FAQ Open Data
Open Data is data that is accessible to everyone online for free, and which can be used, reused and distributed provided that the data source is attributed. More broadly, the term Open Data applies to the practice of sharing data in the manner described above. Open Data is one of the pillars of Open Science, alongside Open Access to publications, Open Source principles for the creation and availability of software, and others.
Open Data is typically shared through open online repositories. Access restrictions or monetary requirements imposed on the data user are incompatible with Open Data. Registration requirements are sometimes considered incompatible with Open Data as well.
Sharing data as openly as possible is crucial for two main reasons:
First, data availability guarantees transparency and supports the replicability of studies. Thus, it is part of good scientific practice in line with the DFG Code on Safeguarding Good Scientific Practice (Guidelines for Safeguarding Good Research Practice) and the DFG Guidelines on the Handling of Research Data.
Second, making data open allows for different analyses, combinations and meta-analyses of data, thus opening up new avenues for scientific exploration.
For these two reasons, a wide range of governments, funders and professional associations support and often require open data practices. This includes EU research funding, where Open Data is the default. Charité and MDC, as well as all three large Berlin universities (HU, FU, TU) have all signed the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, which includes a commitment to the importance of sharing research data.
In addition to benefits for the scientific enterprise overall, as stated in the previous section, data sharing also has advantages for individual researchers. Firstly, studies with corresponding Open Data have been shown to have a citation advantage, see e.g. Colavizza et al., 2019 (preprint). Secondly, having the data freely available online provides security in case the data are requested later. Whatever the reason for the request, a later incapability to provide the data would raise doubts about their validity, and in extreme cases might even provoke a retraction. Thirdly, data sharing supports a data structure and annotation which is comprehensible and complete. This can also be highly helpful if oneself is to access the data later. Lastly, in the current environment where data sharing is still the exception, proven openness reflects one’s own commitment to data integrity and transparency.
If you are a researcher from Charité or MDC, also see „How will I be recognized for sharing data?"
Data in this context includes all digital outputs of research. Thus, it includes raw and pre-processed (e.g. normalized or aligned) data, metadata (i.e. “data on data”), software code, protocols and any other output necessary to understand, reproduce or reanalyze data. The sharing of heavily processed data alone (e.g. summary statistics data underlying figures) is, however, not considered Open Data. Sharing of data which underlie publications is more common, but sharing of well-documented stand-alone datasets can be very valuable for (meta-)analysis by other researchers as well. The desired case would be providing raw data, pre-processed data and – if applicable - the code or steps necessary to arrive at the processed data.
Personalized data, including data from patients and healthy participants, are subject to restrictions, including those based on the EU general data protection regulation (GDPR). Despite restrictions, sharing personalized data is possible in many cases if appropriate steps are taken. Typically, sharing requires either full anonymization (if possible) or so-called ‘de facto anonymization’, as well as corresponding patient or participant consent in either case. The respective forms and information require a positive assessment by the respective ethics committee responsible for the clinical study.
Thus, sharing human data is possible, if the right steps regarding consent, de-identification and/or data access restrictions are taken. In some cases data can be made fully open, while in others access restrictions need to be applied. In either case making personal data available for reuse whenever possible is very important given the ethical imperative to make best use of such data, as well as the effort and cost to collect them. Best practices in sharing sensitive data are conveyed in the seminars ‘Data sharing’ organized by the QUEST Center (see here for contents of seminar on June 3rd 2020).
For further information on open sharing of personalized data, as well as alternatives where full openness is not possible, see the document “Grundlegende Informationen zu Open Data-Publikationen bei Studien mit personenbezogenen Daten” (in German). This document was endorsed by the Charité data protection office, and received a positive vote from the Charité ethics committee. Please note that the document is not legally binding, and that at other institutions the organizational and legal situation might be different. To Charité members, the data protection office provides further information and offers consultation services. If you are working with data obtained from another institution or source, it is important to also check their data sharing requirements and regulations.
In particular the following reasons can restrict or even exclude the open sharing of data:
- Intellectual property & patents: Data with commercialization value need to be assessed by Berlin Health Innovations, the Charité-BIH technology transfer office
- Threat: Data which pose a threat to security (e.g. biohazard research) might not be openly shareable
- Contracts: Contracts, especially with industry sponsors, might restrict or prohibit open data sharing
- Data protection: Open Data with personal data is not always possible, and must be in agreement with the general data protection regulation (GDPR).
Please consult the section “How can I share personal data while accounting for data protection?” of this FAQ.
Open Data is made available through repositories. There are three broad categories:
- If an established disciplinary repository exists in your field, this is typically the best choice, as the data will be most visible and the annotation and data structure will facilitate reuse. Re3data is a search engine for more than 2000 repositories. FAIRsharing also provides a comprehensive and highly curated list of data repositories, as well as standards and policies.
- An all-purpose repository such as Zenodo or Open Science Framework is a good choice if no established disciplinary repository exists. The level of structure and annotation will typically be lower, decreasing reusability, but the high degree of consolidation can be an advantage with respect to long-term availability.
- An institutional repository can be a fallback option, but is typically the least visible and integrated. As of 1/2019, neither Charité nor MDC provide such a repository, but if you should have the need for such a repository, please communicate it to the QUEST Center to support future infrastructure development.
- For analysis software, git-based repositories allowing code versioning are recommended while the project is ongoing. After its completion, it is best practice to save a “snapshot” of the git-repository in an all-purpose repository.
- If the data underlie a “regular” research publication, the publisher will sometimes point to specialized repositories in the author guidelines, or even require the use of a repository with which it has a contract.
- Data shared as supplementary materials are difficult to find and integrate, and therefore this practice is not recommended. This is due to supplementary materials typically not receiving their own DOI (Digital Object Identifier) and not being annotated with metadata.
- If the dataset is multi-faceted, extensive or otherwise valuable to the community and/or if the findings are “negative”, a publication in a data journal (see list) can be considered. A data journal publication consists of a detailed description of the dataset and links to the data described.
- Other options include attaching data to a preprint (e.g. on bioRxiv) or publishing dataset descriptions as micropublications or via publishing platforms.
- However, sharing data in a repository only is a low-effort and legitimate way of making data available in a well-documented and easily accessible way – it can even be useful for future use for oneself or for members within the same institution or lab.
- The QUEST Center provides the ‘match-making’ tool fiddle, which helps to identify publishing venues for datasets, as well as for articles with null or neutral findings, which are often difficult to publish in traditional journals. The need to combat publication bias, as well as the tool itself, is described in a publication (see preprint)
The FAIR data principles (Wilkinson et al., 2016) give an overview of how to make data Findable, Accessible, Interoperable, and Reusable. Once you have addressed any relevant regulations and identified a repository, it’s important to consider the following when sharing data:
- Share the data in machine-readable, ideally open source formats. For example, for tables .csv format is ideal and .xls is also acceptable, while .pdf is strongly discouraged and is not considered to be Open Data by some definitions.
- Structure and name the files in a comprehensible way, and document it in a readme file.
- Document the data with appropriate metadata. When preparing your metadata, think about what information someone else would need to understand your analyses or use your dataset. This can include whatever experimental information is not included in the data itself, as well as abbreviations, keywords, an abstract explaining the dataset, author names and affiliations, creation date, a primary contact, etc. While some basic metadata will be required in all repositories, experiment-specific metadata might not be a standard requirement in all-purpose databases. However, this information is critical to understand and reuse the data.
- Attach a license to the data, ideally a Creative Commons (CC) license, to make sure they are shared and reused in the desired way. CC-BY (unlimited use with citation) is the most common for research data. However, some institutions have specific requirements for what types of licenses are permitted, and this needs to be considered for multicenter studies or collaborations.
It should also be noted that according to some legal scholars, at least most types of research data are not subject to copyright in the German legal framework, because the threshold of originality (Schöpfungshöhe) is not attained. This would make a license – CC or otherwise - attached to a dataset invalid, and even potentially misleading. However, the use of CC licenses for research data has so far not been legally challenged, and remains widely recommended, including by the European Union.
Data should ideally be shared at the time of publishing the corresponding manuscript, which will allow to reference and link the dataset in the publication. However, there can be reasons why data should not be made open immediately, e.g. to protect intellectual property. In this case, a publication at a later date should be considered. To accommodate this, many repositories, including Zenodo, allow an embargo period. Data are inaccessible during this period, and become automatically open thereafter.
If the dataset continues to grow after it has first been shared, it can be updated. Many repositories like Zenodo allow versioning, which guarantees that different versions are accessible, while the top-level DOI linked in a publication also remains valid. For datasets which are collected over an extended period of time and/or where timely publication is crucial (e.g. in epidemiology research), a regular or even online updating of the dataset (“dynamic data”) is ideal, but currently this is not supported by all-purpose repositories.
- Many journals already require a data availability statement and will detail, where and how this is to be provided. If the journal does not stipulate that, it is recommended to include such a statement in the methods section.
- If the data underlying the publication have been shared via a repository, the dataset should be cited similarly to citing a research paper (see how-to guide of the Digital Curation Centre).
If an established disciplinary repository exists, it can of course be searched directly. If the repository containing the data is not known or the search is not restricted to a single repository, search engines for datasets can be used. The most important ones include dataMED and Google dataset search, and often a search in Google or Google scholar will also yield results. Commercial citation databases will also allow you to search for datasets. As of 1/2019, these search engines are not yet as powerful as those for publication searches (see also QUEST blog entry “How to find open research data”).
This will depend on the way data are published. For a dataset published in a repository, recognition is expected in the form of citations. In the case of publishing the dataset description in a data journal, the foremost recognition will typically be for the journal article, adding to citations of the dataset itself. It has also been shown that making data available increases the citation rate of corresponding “regular” journal articles. Shared data are in addition increasingly recognized as a valuable scientific output in its own right, and thus shared datasets or code can e.g. be listed in CVs or via ORCID.
At the Charité, considering such scientific contributions (as part of the so-called QUEST-criteria) is now formally part of assessing applications for professorship positions and intramural funding schemes, including the BIH-Charité Clinician Scientist Program.
In addition, starting from 2019, Charité and MDC researchers receive additional performance-oriented funding if they openly shared the data underlying their article publications. At the Charité, this funding is distributed as part of the LoM (Leistungsorientierte Mittelvergabe). For the detection of publications with attached open data, the ODDPub algorithm, developed by the QUEST Center, is used for prescreening, followed by a manual check of putative open data publications.
For information on the inclusion of open data as an indicator in performance-oriented funding, see the program description as well as the detailed criteria for eligible publications. With questions or comments on the rewarding of Open Data at Charité and MDC, please contact email@example.com.
Data sharing (which must not be completely in line with open data principles) is also rewarded by the so-called “Parasite Awards”, which were created following a controversy over data re-users as “research parasites”. Also, the ‚Stifterverband‘ (German donor’s association) gives an „Open Data Impact Award“ for open research data which already have or are expected to have societal impact.
Publications based on secondary data analyses, which harness the potential of re-using open data, are still relatively uncommon in many areas of biomedical research. To support open data re-use and highlight its potential, the QUEST Center is giving away a 1000€ Open Data Reuse award for such publications. Please see here for details of the award call.
Data re-use is also rewarded by the so-called ‘Symbiont Awards’, which are given to researchers for ‘outstanding contributions to the rigorous secondary analysis of data’. In this case, the data re-used must not necessarily have been completely open.
The QUEST Center and the Charité Medical Library jointly offer Open Data Workshops. These 3h workshops, offered in German and English alternatingly, cover topics similar to this FAQ, but in more detail. In addition, some related topics like metadata, digital object identifiers and ORCID are covered, and there is opportunity for practical exercises with data search and databases.
Upcoming Open Data Workshops are listed on the QUEST website “courses and events”. In this list you will also find announcements of workshops and seminars on related topics like data sharing and research data management.
In online training resources, Open Data is typically addressed within the wider frameworks of either research data management or Open Science. The course Best Practices for Biomedical Research Data Management of the Harvard Medical School is particularly recommendable. A good online course with an Open Science focus is “Open Science: Sharing your research with the world”.
For further information, please contact Dr. Evgeny Bobrov, open data and research data management officer of the BIH QUEST Center. He is also available for talks and presentations on Open Data for individual groups and departments.
Dr. Evgeny Bobrov
Berliner Institut für Gesundheitsforschung (BIG) / Berlin Institute of Health (BIH)
QUEST – Quality | Ethics | Open Science | Translation
BIH Center for Transforming Biomedical Research
Open Data and Research Data Management Officer
Tel. +49 (0)30 450 543 069
Further information on open data within the wider context of research data management is provided by the following sources:
- European Commission: information on funders’ and journals’ policies, attitudes towards data sharing etc.
- National Contact Point for EU-Program Horizon 2020: overview of requirements and opt-out options for EU-funded projects (in German)
- DFG Guidelines on the Handling of Research Data (in German)
- BMBF recommendations on the management of research data (in German)