FAQ Open Data

What is Open Data?

Open Data is data that is accessible to everyone online for free, and which can be used, reused and distributed provided that the data source is attributed. More broadly, the term Open Data applies to the practice of sharing data in the manner described above. Open Data is one of the pillars of Open Science, alongside Open Access to publications, Open Source principles for the creation and availability of software, and others.

Where is Open Data?

Open Data is typically shared through open online repositories. Access restrictions or monetary requirements imposed on the data user are incompatible with Open Data. Registration requirements are sometimes considered incompatible with Open Data as well.

Why is Open Data important?

Sharing data as openly as possible is crucial for two main reasons:

First, data availability guarantees transparency and supports the replicability of studies. Thus, it is part of good scientific practice in line with the Charité Charter on Good Scientific Practice and the DFG Guidelines on the Handling of Research Data.

Second, making data open allows for different analyses, combinations and meta-analyses of data, thus opening up new avenues for scientific exploration.

For these two reasons, a wide range of governments, funders and professional associations support and often require open data practices. This includes EU research funding, where Open Data is the default. Charité and MDC, as well as all three large Berlin universities (HU, FU, TU) have all signed the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, which includes a commitment to the importance of sharing research data.

How might data sharing benefit me?

In addition to benefits for the scientific enterprise overall, as stated in the previous section, data sharing also has advantages for individual researchers. Firstly, studies with corresponding Open Data have been shown to have a citation advantage, at least in the field of genetics where this was studied. Secondly, having the data freely available online provides security in case the data are requested later. Whatever the reason for the request, a later incapability to provide the data would raise doubts about their validity, and in extreme cases might even provoke a retraction. Thirdly, data sharing supports a data structure and annotation which is comprehensible and complete. This can also be highly helpful if oneself is to access the data later. Lastly, in the current environment where data sharing is still the exception, proven openness reflects one’s own commitment to data integrity and transparency.

Which data should I share?

Data in this context includes all digital outputs of research. Thus, it includes raw and pre-processed (e.g. normalized or aligned) data, metadata (i.e. “data on data”), software code, protocols and any other output necessary to understand, reproduce or reanalyze data. The sharing of heavily processed data alone (e.g. summary statistics data underlying figures) is, however, not considered Open Data. Sharing of data which underlie publications is more common, but sharing of well-documented stand-alone datasets can be very valuable for (meta-)analysis by other researchers as well. The desired case would be providing raw data, pre-processed data and – if applicable - the code or steps necessary to arrive at the processed data.

How can I share personal data while accounting for data protection?

Personalized data, including data from patients and healthy participants, are subject to restrictions, including those based on the EU general data protection regulation (GDPR). Despite restrictions, sharing personalized data is possible in many cases if appropriate steps are taken. Typically, sharing requires either full anonymization (if possible) or so-called ‘de facto anonymization’, as well as corresponding patient or participant consent in either case. The respective forms and information require a positive assessment by the respective ethics committee responsible for the clinical study.

Thus, human data can also be Open Data, and indeed constitute some of the most important datasets, given the difficulty and cost of data collection.

For further information on open sharing of personalized data, as well as alternatives where full openness is not possible, see the document “Grundlegende Informationen zu Open Data-Publikationen bei Studien mit personenbezogenen Daten” (in German). This document was endorsed by the Charité data protection office, and received a positive vote from the Charité ethics committee. Please note that the document is not legally binding, and that at other institutions the organizational and legal situation might be different. To Charité members, the data protection office provides further information and offers consultation services. If you are working with data obtained from another institution or source, it is important to also check their data sharing requirements and regulations.

Which data might not be suitable?

In particular the following reasons can restrict or even exclude the open sharing of data:

  • Intellectual property & patents: Data with commercialization value need to be assessed by Berlin Health Innovations, the Charité-BIH technology transfer office
  • Threat: Data which pose a threat to security (e.g. biohazard research) might not be openly shareable
  • Contracts: Contracts, especially with industry sponsors, might restrict or prohibit open data sharing
  • Data protection: Open Data with personal data is not always possible, and must be in agreement with the general data protection regulation (GDPR).
    Please consult the section “How can I share personal data while accounting for data protection?” of this FAQ.

Where can I share data?

Open Data is made available through repositories. There are three broad categories:

  • If an established disciplinary repository exists in your field, this is typically the best choice, as the data will be most visible and the annotation and data structure will facilitate reuse. Re3data is a search engine for more than 2000 repositories. FAIRsharing also provides a comprehensive and highly curated list of data repositories, as well as standards and policies.
  • An all-purpose repository such as Zenodo or Open Science Framework is a good choice if no established disciplinary repository exists. The level of structure and annotation will typically be lower, decreasing reusability, but the high degree of consolidation can be an advantage with respect to long-term availability.
  • An institutional repository can be a fallback option, but is typically the least visible and integrated. As of 1/2019, neither Charité nor MDC provide such a repository, but if you should have the need for such a repository, please communicate it to the QUEST Center to support future infrastructure development.
  • For analysis software, git-based repositories allowing code versioning are recommended while the project is ongoing. After its completion, it is best practice to save a “snapshot” of the git-repository in an all-purpose repository.
  • If the data underlie a “regular” research publication, the publisher will sometimes point to specialized repositories in the author guidelines, or even require the use of a repository with which it has a contract.
  • Data shared as supplementary materials are difficult to find and integrate, and therefore this practice is not recommended. This is due to supplementary materials typically not receiving their own DOI (Digital Object Identifier) and not being annotated with metadata.

Where can I share data that do not accompany a full-length article publication?

  • If the dataset is multi-faceted, extensive or otherwise valuable to the community and/or if the findings are “negative”, a publication in a data journal (see list) can be considered. A data journal publication consists of a detailed description of the dataset and links to the data described.
  • Other options include attaching data to a preprint (e.g. on bioRxiv) or publishing dataset descriptions as micropublications or via publishing platforms.
  • However, sharing data in a repository only is a low-effort and legitimate way of making data available in a well-documented and easily accessible way – it can even be useful for future use for oneself or for members within the same institution or lab.

How can I share data?

The FAIR data principles (Wilkinson et al., 2016) give an overview of how to make data Findable, Accessible, Interoperable, and Reusable. Once you have addressed any relevant regulations and identified a repository, it’s important to consider the following when sharing data:

  • Share the data in machine-readable, ideally open source formats. For example, for tables .csv format is ideal and .xls is also acceptable, while .pdf is strongly discouraged and is not considered to be Open Data by some definitions.
  • Structure and name the files in a comprehensible way, and document it in a readme file.
  • Document the data with appropriate metadata. When preparing your metadata, think about what information someone else would need to understand your analyses or use your dataset. This can include whatever experimental information is not included in the data itself, as well as abbreviations, keywords, an abstract explaining the dataset, author names and affiliations, creation date, a primary contact, etc. While some basic metadata will be required in all repositories, experiment-specific metadata might not be a standard requirement in all-purpose databases. However, this information is critical to understand and reuse the data.
  • Attach a license to the data, ideally a Creative Commons license, to make sure they are shared and reused in the desired way. CC-BY (unlimited use with citation) is the most common for research data. However, some institutions have specific requirements for what types of licenses are permitted, and this needs to be considered for multicenter studies or collaborations.

When to share data?

Data should ideally be shared at the time of publishing the corresponding manuscript, which will allow to reference and link the dataset in the publication. However, there can be reasons why data should not be made open immediately, e.g. to protect intellectual property. In this case, a publication at a later date should be considered. To accommodate this, many repositories, including Zenodo, allow an embargo period. Data are inaccessible during this period, and become automatically open thereafter.

If the dataset continues to grow after it has first been shared, it can be updated. Many repositories like Zenodo allow versioning, which guarantees that different versions are accessible, while the top-level DOI linked in a publication also remains valid. For datasets which are collected over an extended period of time and/or where timely publication is crucial (e.g. in epidemiology research), a regular or even online updating of the dataset (“dynamic data”) is ideal, but currently this is not supported by all-purpose repositories.

How should I reference a dataset in my publication?

  • Many journals already require a data availability statement and will detail, where and how this is to be provided. If the journal does not stipulate that, it is recommended to include such a statement in the methods section.
  • If the data underlying the publication have been shared via a repository, the dataset should be cited similarly to citing a research paper (see how-to guide of the Digital Curation Centre).

How can I find data for re-use?

If an established disciplinary repository exists, it can of course be searched directly. If the repository containing the data is not known or the search is not restricted to a single repository, search engines for datasets can be used. The most important ones include dataMED and Google dataset search, and often a search in Google or Google scholar will also yield results. Commercial citation databases will also allow you to search for datasets. As of 1/2019, these search engines are not yet as powerful as those for publication searches (see also QUEST blog entry “How to find open research data”).

How will I be recognized for sharing data?

This will depend on the way data are published. For a dataset published in a repository, recognition is expected in the form of citations. In the case of publishing the dataset description in a data journal, the foremost recognition will typically be for the journal article, adding to citations of the dataset itself. It has also been shown that making data available increases the citation rate of corresponding “regular” journal articles. Shared data are in addition increasingly recognized as a valuable scientific output in its own right, and thus shared datasets or code can e.g. be listed in CVs or via ORCID. At the Charité, considering such scientific contributions (as part of the so-called QUEST-criteria) is now formally part of assessing applications for professorship positions and intramural funding schemes, including the BIH-Charité Clinician Scientist Program.

How will I be recognized for re-using data?

Publications based on secondary data analyses, which harness the potential of re-using open data, are still relatively uncommon in many areas of biomedical research. To support open data re-use and highlight its potential, the QUEST Center is giving away a 1000€ Open Data Reuse award for such publications. Please see here for details of the award call.

Where can I get support?

Dr. Evgeny Bobrov

Berliner Institut für Gesundheitsforschung (BIG) / Berlin Institute of Health (BIH)
QUEST – Quality | Ethics | Open Science | Translation
BIH Center for Transforming Biomedical Research
Open Data and Research Data Management Officer
evgeny.bobrov@bihealth.de
Tel. +49 (0)30 450 543 069

Where can I find more information?

Further information on open data within the wider context of research data management is provided by the following sources: