Architecture of the Virtual Research Environment

How is the Virtual Research Environment (VRE) structured? What other infrastructures does the platform interface with and enable researchers to access and use data securely? Here you will find a schematic representation of the VRE and an explanation of its key elements.

Key elements

Research Portal

The primary interface for researchers to access VRE functions and resources, including data capture tools, interactive dashboards and viewers, query tools and analysis workspaces. 

Data Gateway API

An application programming interface that enables the Research Portal to exchange data and metadata with VRE systems, and allows the VRE to be interoperable with other data platforms, data sources and systems.

Green Room

An environment in which hospital data, such as data derived from electronic health records (EHRs), picture archiving and communication systems (PACS), laboratories, and other sources, can be de-identified, transformed and prepared prior to being transferred to VRE systems and made available for research use.

Data Lake

A zone within the VRE in which data of any type can be received, stored, catalogued, and ultimately ingested into platform databases. Standard data models and ontologies are applied to allow datasets to be aggregated and processed. Data can be further de-identified here for broader sharing. Quality assurance, quality control and preprocessing pipelines prepare the data for visualisation and analysis. 

Data Warehouse

A set of databases and services which transform diverse data into a unified, federated context. Federation is critical for harmonizing data and metadata so that information about participants and datasets can be queried, visualized, and analyzed across studies, data sources and modalities. The Metadata Repository and Knowledge Graph implement standardized and extensible schemas to represent metadata derived from datasets as well as annotations generated by researchers. 

Shared Services

Automated services that are required by VRE components. The Participant Registry includes systems for generating and storing unique pseudonymized identifiers for all research participants contributing data to the VRE. Identity and access management systems allow the Research Portal and other VRE components to assign or validate the identity and permissions of VRE users.

Workspaces and Analytics Resources

Workspaces are flexible and interactive environments in which users can access, visualise and analyse their data with a range of analysis and visualisation tools. These are supported by underlying computing infrastructure as well as privacy preserving linkage systems that allow de-identified datasets to be linked and compared. 

Charité/BIH Infrastructure and Services

Various IT systems and services are integrated to provide networking, computing, storage and other infrastructure required by the VRE. This includes services that transfer data from existing research and clinical data sources (e.g. REDCap, PACS) into the Green Room; Hadoop and other systems deployed within the Health Data Platform; high performance computing clusters for running preprocessing pipelines; and backup, recovery and data archival systems that ensure that data are kept safe and available at all times. 

VRE Architecture explained in video

The VRE Portal allows research teams to discover and query data, access user friendly dashboards and visualizations, directly process and analyze datasets within interactive workbenches, and manage their projects and permissions.

Data uploaded by researchers lands in the VRE Green Room, a restricted zone which provides pipelines and interfaces for pseudonymizing and preparing datasets before they can be processed further and shared. Once this has been done, approved data are copied from the Green Room into the Data Lake located in the VRE Core zone. The Data Lake includes services and pipelines for data cataloguing, quality control, curation, standardization and processing.

Datasets collected and processed across research projects and modalities, including data points from clinical, imaging and other data types such as genomics, can be aggregated into a centralized Data Warehouse. Metadata on these datasets are also captured centrally within the Knowledge Graph and the Metadata Repository. These systems support researchers in finding and extracting their data for the purposes of downstream analysis and visualization.

Using the VRE Analysis Workbench, researchers can work securely inside the platform to process and analyze their datasets. For example, image data processing pipelines can be developed and executed in a high performance computing environment provided by the BIH HPC cluster, which is integrated with the VRE; and simulations and analyses based on machine learning or other computationally demanding methods can be conducted within customized virtual machines or containerized environments.  The results of this processing and analysis can flow back into the Data Lake to ensure that full provenance and data lineage is maintained across the data lifecycle.

A number of essential Platform Services are also deployed to support the functionality of the Portal, Green Room and VRE Core and to ensure that the platform operates securely and efficiently. These include an API Gateway, Authentication and Authorization services, Pipeline Orchestration and Messaging. Overall the VRE has been designed and implemented using a microservices architecture, which enables the platform to be reliable, resilient, scalable, and highly adaptable to evolving research requirements and to changes to the underlying IT infrastructure.

The VRE will support a variety of research data needs, ultimately leading to an ecosystem of researchers, data scientists, and developers working together to make medical research data secure, findable and usable.