QUEST Blog: How to find open research data

Blog by Dr. Ulf Tölch

In an ideal Open Science world, we all would be able to find and reuse the data that were generated during previous studies. Yet, the world looks very different today. Right now, the landscape of data repositories and search engines is fragmented. There are different ways of storing and labeling the datasets: Datasets can be found in the supplement of publications, as separate data publications, in community specific repositories, or in general-purpose repositories like Zenodo, Figshare, or OSF.

All those options can be an important part of the solution to make research data more reusable. Especially community specific repositories allow researchers from the same field – where similar datasets are routinely produced – to define their field specific standard to facilitate data interoperability. But these many different ways of data sharing make finding useful and openly available datasets more difficult.

A solution to the problem would be a meta search engine that can parse through all those different data sources and combine the results. Even though there is not yet a search engine that is broad enough to make dataset search as easy as publication search, there are still some viable options:

Datamed.org:
Datamed searches 75 large biomedical repositories. Right now there are 2.336.000 indexed datasets reported and many different ways to search for datasets are possible (e.g. for different topics, diseases, but also authors or organizations). It also features an API for programmatic access to the database.

Data Citation Index (DCI):
The DCI is integrated into the Web of Science and has a broad variety of search possibilities. It covers 350 repositories with 7 mio. records from various disciplines and is thus the search engine with the largest number of records in this list. It can also show usage and citation statistics for the datasets. The downside, however, is that this is a commercial provider any you only have access if your institution pays for this.

Google Dataset search:
Searches for datasets in “thousands of repositories in the web” (though I could nowhere find a list of these repositories). However, this is not the usual Google full text search, but the search engine relies on the presence of a certain metadata format. The number of found datasets will thus depend on the prevalence of this format in the future.

Dataverse:
Not really a meta search engine, but a set of data repositories that are hosted locally at different institutions “to share, preserve, cite, explore, and analyze research data”. Currently, there are approx. 3.000 local dataverses with a total of approx. 80.000 searchable datasets (the majority of them belonging to the social sciences). 

This list is not exhaustive and hopefully the options for dataset search will increase further in the future. If you have any suggestions on what search engine is definitely missing on this list, please feel free to write a comment.