Open Science
Open Data
Open Data means the publication of research data for the purpose of transparency and subsequent use by others. As a member of the EUHA, Charité has signed the Sorbonne Declaration on Research Data Rights and is committed to promoting open data.
The FAQ Open Data (see below) answers all important questions about Open Data. Charité promotes the sharing of data by awarding an Open Data LOM (performance-based allocation of funds). The BIH also promotes this as part of the IOM. The Charité Metrics Dashboard also takes into account the openness of the data and its reusability according to the FAIR criteria. Please also note the FAQs on Data Management Plans (DMPs).
FAQ Open Data
Open Data is data that is accessible to everyone online for free, and which can be used, reused and distributed provided that the data source is attributed. More broadly, the term Open Data applies to the practice of sharing data in the manner described above. Open Data is one of the pillars of Open Science, alongside Open Access to publications, Open Source principles for the creation and availability of software, and others.
Open Data is typically shared through open online repositories. Access restrictions or monetary requirements imposed on the data user are incompatible with Open Data. Registration requirements are sometimes considered incompatible with Open Data as well. While sharing data under access restrictions is not considered "Open Data", it is often the only way of sharing due to privacy reasons, and is thus also very valuable.
Sharing data as openly as possible is crucial for two main reasons:
First, data availability guarantees transparency and supports the replicability of studies. Thus, it is part of good scientific practice in line with the DFG Code on Safeguarding Good Scientific Practice (Guidelines for Safeguarding Good Research Practice) and the DFG Guidelines on the Handling of Research Data (also see the DFG Checklist on the handling of research data in research projects).
Second, making data open allows for different analyses, combinations, and meta-analyses of data, thus opening up new avenues for scientific exploration.
For these two reasons, a wide range of governments, funders and professional associations support and often require open data practices. This includes EU research funding, where Open Data is the default. Charité and MDC, as well as all three large Berlin universities (HU, FU, TU) have all signed the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, which includes a commitment to the importance of sharing research data
In addition to benefits for the scientific enterprise overall, as stated in the previous section, data sharing also has advantages for individual researchers. Firstly, studies with corresponding Open Data have been shown to have a citation advantage, see e.g. Colavizza et al., 2020. Secondly, having the data freely available online provides security in case the data are requested later. Whatever the reason for the request, a later incapability to provide the data would raise doubts about their validity, and in extreme cases might even provoke a retraction. Thirdly, data sharing supports a data structure and annotation which is comprehensible and complete. This can also be highly helpful if oneself is to access the data later. Lastly, in the current environment where data sharing is still the exception, proven openness reflects one’s own commitment to data integrity and transparency.
If you are a researcher from Charité, also see How will I be recognised for sharing data?
Data in this context includes all digital outputs of research. Thus, it includes raw and pre-processed (e.g. normalized or aligned) data, metadata (i.e. "data on data"), software code, protocols and any other output necessary to understand, reproduce or reanalyze data. The sharing of heavily processed data alone (e.g. summary statistics data underlying figures) is, however, not considered Open Data. Sharing of data which underlie publications is more common, but sharing of well-documented stand-alone datasets can be very valuable for (meta-)analysis by other researchers as well. The desired case would be providing raw data, pre-processed data and – if applicable - the code and/or protocol necessary to arrive at the processed data.
Personalized data, including data from patients and healthy participants, are subject to restrictions, in particular the EU general data protection regulation (GDPR). Despite restrictions, sharing personalized data is possible in many cases if appropriate steps are taken. Two ways of sharing personal data are most common: sharing fully anonymized data openly, or sharing de-identified (but not fully anonymous) data under restrictions.
Full anonymization can be possible, but often this is not the case for datasets characterized by multiple rare clinical variables and/or small study samples and/or rare diseases populations. In addition, certain types of data as e.g. imaging and genetic data are particularly difficult to anonymize. In case of fully anonymized data, the fact that a link to identifying data remains in the study center is not necessarily an obstacle to data sharing. Most sources agree that legally there is no obligation to obtain consent for the anonymization step. However, this is not fully clear, and in any case it is recommended to obtain consent for ethical reasons, if anonymization and subsequent sharing of data is planned.
The majority of datasets cannot be made fully anonymous even by removing direct identifiers (as e.g. name, address, birthday, id numbers). In these cases, patient or participant consent for sharing is always required. In addition, such data can typically only be shared under access restrictions. Respective forms and information (data protection information and consent form, data protection concept) need to be prepared and submitted to the respective ethics committee and data protection office (or, for clinical studies at Charité, the Clinical Trial Office).
Thus, sharing human data is possible, if the right steps regarding consent, de-identification and/or data access restrictions are taken. In some cases data can be made fully open, while in others access restrictions need to be applied. In either case, making personal data available for reuse whenever possible is very important given the ethical imperative to make best use of such data (see e.g. Mello et al., 2018), as well as the effort and cost to collect them.
The BIH QUEST Center can provide example consent statements and consult on the implementation of data sharing. If you need support with data anonymization, you can contact the Medical Informatics Group (). Two departments of the Charité can support you with questions regarding Data Protection. You may contact the Clinical Trial Office () for support in a specific study or clinical trial. For other cases where personal data is handled you can contact the Data Protection Support (). Both can provide the necessary forms, as well as support with handling them. We highly recommend to contact one of these departments in case you are in any doubt regarding the legal basis or the mode of sharing personal data.
In particular the following reasons can restrict or even exclude the open sharing of data:
- Intellectual property & patents: Data with commercialization value need to be assessed by Charité BIH Innovation, the Charité and BIH technology transfer office
- Threat: Data which pose a threat to security (e.g. biohazard research) might not be openly shareable
- Contracts: Contracts, especially with industry sponsors, might restrict or prohibit open data sharing
- Data protection: Open Data with personal data is not always possible, and must be in agreement with the general data protection regulation (GDPR).
Please consult the section "How can I share personal data while accounting for data protection?" of this FAQ.
Open Data is made available through repositories. There are two broad categories currently available to Charité researchers.
- If an established disciplinary repository exists in your field, this is typically the best choice, as the data will be most visible and the annotation and data structure will facilitate reuse. Re3data is a search engine for more than 2000 repositories. FAIRsharing also provides a comprehensive and highly curated list of data repositories, as well as standards and policies.
- An all-purpose repository such as Zenodo, figshare, or Open Science Framework is a good choice if no established disciplinary repository exists. Of these, Zenodo is the most recommendable due to its European location, long-term funding, and good data findability. The level of structure and annotation will typically be lower than in disciplinary repositories, decreasing reusability, but the high degree of consolidation can be an advantage with respect to long-term availability.
- For analysis software, git-based repositories – most commonly Github - allowing code versioning are recommended while the project is ongoing. After its completion, it is best practice to save a "snapshot" of the git-repository in an all-purpose repository. Zenodo ist particularly suitable for this.
- If the data underlie a "regular" research publication, the publisher will sometimes point to specialized repositories in the author guidelines, or even require the use of a repository with which it has a contract.
- Data shared as supplementary materials are difficult to find and integrate, and therefore this practice is not recommended. This is due to supplementary materials typically not receiving their own DOI (Digital Object Identifier) and not being annotated with metadata.
- If the dataset is multi-faceted, extensive or otherwise valuable to the community and/or if the findings are "negative", a publication in a data journal (see list of data journals) can be considered. A data journal publication consists of a detailed description of the dataset and links to the data described.
- Other options include attaching data to a preprint (e.g. on bioRxiv) or publishing dataset descriptions as micropublications or via publishing platforms.
- However, sharing data in a repository only is a low-effort and legitimate way of making data available in a well-documented and easily accessible way – it can even be useful for future use of data by oneself, as well as by members of the same institution or lab.
- The QUEST Center provides the ‘match-making’ tool fiddle, which helps to identify publishing venues for datasets, as well as for articles with null or neutral findings, which are often difficult to publish in traditional journals. The need to combat publication bias, as well as the tool itself, is described in a publication (Bernard et al., 2020).
The FAIR data principles (Wilkinson et al., 2016) give an overview of how to make data Findable, Accessible, Interoperable, and Reusable. Once you have addressed any relevant regulations and identified a repository, it’s important to consider the following when sharing data:
- Share the data in machine-readable, ideally open source formats. For example, for tables .csv format is ideal and .xlsx is also acceptable, while .pdf is strongly discouraged and is not considered to be Open Data by some definitions.
- Structure and name the files in a comprehensible way, and document it in a readme file.
- Document the data with appropriate metadata. When preparing your metadata, think about what information someone else would need to understand your analyses or use your dataset. This can include whatever experimental information is not included in the data itself, as well as abbreviations, keywords, an abstract explaining the dataset, author names and affiliations, creation date, a primary contact, etc. While some basic metadata will be required in all repositories, experiment-specific metadata might not be a standard requirement in all-purpose databases. However, this information is critical to understand and reuse the data.
- Attach a license to the data, ideally a Creative Commons (CC) license, to make sure they are shared and reused in the desired way. CC-BY (unlimited use with citation) is the most common for research data.
It should also be noted that according to some legal scholars, at least most types of research data are not subject to copyright in the German legal framework, because the threshold of originality (Schöpfungshöhe) is not attained. This would make a license – CC or otherwise - attached to a dataset invalid, and even potentially misleading. However, the use of CC licenses for research data has so far not been legally challenged, and remains widely recommended, including by the European Union.
Data should ideally be shared at the time of publishing the corresponding manuscript, which will allow to reference and link the dataset in the publication. However, there can be reasons why data should not be made open immediately, e.g. to protect intellectual property. In this case, a publication at a later date should be considered. To accommodate this, many repositories, including Zenodo, allow an embargo period. Data are inaccessible during this period, and become automatically open thereafter.
If the dataset continues to grow after it has first been shared, it can be updated. Many repositories like Zenodo allow versioning, which guarantees that different versions are accessible, while the top-level DOI linked in a publication also remains valid. For datasets which are collected over an extended period of time and/or where timely publication is crucial (e.g. in epidemiology research), a regular or even online updating of the dataset ("dynamic data") is ideal, but currently this is not supported by all-purpose repositories.
- Many journals require a data availability statement and will detail, where and how this is to be provided. If the journal does not stipulate that, it is recommended to include such a statement in the methods section. Also include re-used datasets in the data availability statement.
- If the data underlying the publication have been shared via a repository, the dataset should be cited similarly to citing a research paper (see how-to guide of the Digital Curation Centre). Importantly, own data should also be cited in the reference list, just like other datasets (or one’s own publications).
If an established disciplinary repository exists, it can of course be searched directly. If the repository containing the data is not known or the search is not restricted to a single repository, search engines for datasets can be used. The most important ones include DataCite and Google dataset search, and often a search in Google or Google scholar will also yield results. Commercial citation databases will also allow you to search for datasets. As of 09/2023, these search engines are, however, not yet as powerful as those for publication searches.
This will depend on the way data are published. For a dataset published in a repository, recognition is expected in the form of citations. In the case of publishing the dataset description in a data journal, the foremost recognition will typically be for the journal article, adding to citations of the dataset itself. It has also been shown that making data available increases the citation rate of corresponding "regular" journal articles. Shared data are in addition increasingly recognized as a valuable scientific output in its own right, and thus shared datasets or code can e.g. be listed in CVs or via ORCID.
At the Charité, considering such scientific contributions is now formally part of assessing applications for professorship positions and intramural funding schemes, including the BIH-Charité Clinician Scientist Program.
In addition, starting from 2019, Charité researchers receive additional performance-oriented funding if they openly shared the data underlying their article publications. At the Charité, this funding is distributed as part of the LoM (Leistungsorientierte Mittelvergabe), and at the BIH as part of the IOM. For information on the inclusion of Open Data as an indicator in performance-oriented funding, see the program description. Bobrov et al. (2023) details the applied Open Data definition, while Iarkaeva et al. (2023) describes the process of detecting the publications with shared underlying data. This process consists of an automated screening step using the ODDPub algorithm, developed by the QUEST Center, followed by a manual check of putative open data publications. Sharing data under access restrictions is also in line with "Open Data LoM" criteria, if the data could not be shared openly for privacy reasons. With questions or comments on the rewarding of Open Data at Charité, please contact quest@bih-charite.de.
Furthermore, since 7/2023, the QUEST Center is awarding a 1000€ Open Data Contributor Award to Charité researchers whose publicly shared datasets have formed the basis for a publication by other, unrelated researchers.
Data sharing (which must not be completely in line with open data principles) is also rewarded by the so-called Parasite Awards, which were created following a controversy over data re-users as research parasites.
Publications based on secondary data analyses, which harness the potential of re-using open data, are still relatively uncommon in many areas of biomedical research. To support open data re-use and highlight its potential, the QUEST Center is giving away a 1000€ Open Data Reuse award for such publications. Please see here for details of the award call.
Data re-use is also rewarded by the so-called Symbiont Awards, which are given to researchers for "outstanding contributions to the rigorous secondary analysis of data". In this case, the data re-used must not necessarily have been completely open.
The QUEST Center offers a seminar Data Sharing in Clinical Research, which includes a self-learning and an interactive part. In addition, the three universities of the Berlin University Alliance regularly offer courses on different aspects of research data management, including open data. Similarly, in online training resources, Open Data is often addressed within the wider frameworks of either research data management (RDM) or Open Science. A course providing a general overview on Open Data specifically is Open Data Essentials, developed by the Open Data Institute for the European Commission. The Data Management Skillbuilding Hub offers a good collection of training materials on RDM, although some aspects are specific to the earth sciences. A good online course with a general Open Science focus is Open Science: Sharing your research with the world. If you work in RDM support, especially as a librarian, the curriculum of the RDM Librarian Academy is worthwhile.
For further information, please contact Dr. Evgeny Bobrov, open data and research data management officer of the BIH QUEST Center. He is also available for talks and presentations on Open Data and research data management for individual groups and departments.
Dr. Evgeny Bobrov
Berliner Institut für Gesundheitsforschung in der Charité / Berlin Institute of Health at Charité (BIH)
QUEST – Quality | Ethics | Open Science | Translation
BIH Center for Responsible Research
Project team leader open data and research data management
Tel. +49 (0)30 450 543 069Further information on open data within the wider context of research data management is provided by the following sources:
- Information on the EU’s open science policy, including Open Data
- National Contact Point for EU-Program Horizon 2020: overview of requirements and opt-out options for EU-funded projects (in German)
- DFG Guidelines on the Handling of Research Data, including a checklist for the description of data management in funding applications
- NIH Policy for Data Management and Sharing, in effect since January 2023; an overview is also available