Overview

One of the stated goals of the NIH Big Data to Knowledge (BD2K) initiative is to harvest the wealth of information contained in Big Data to advance our knowledge of human health and disease. As part of this initiative the Harvard University Medical School (HMS) Department of Biomedical Informatics Patient-Centered Informatics Common: Standard Unification of Research Elements (PIC-SURE) is developing an open-source infrastructure that will foster the incorporation of multiple heterogeneous patient level clinical, omic, and environmental datasets. This system embraces the idea of decentralized repositories (resources) of varying types, and protocols. It provides a single simple secure communication interface that can perform queries, joins, and computations across different resources. To implement this we created the BD2K PIC-SURE RESTfull API. Short acronym is IRCT (Inter Resource Communication Tool) which provides a resource agnostic service through which multiple resources of varying types, and protocols can be accessed. Users communicate with this service using a series of Representative State Transfer (RESTful) calls.

The easiest way to understand what the IRCT is by explaining what it isn’t. First and foremost it is not an attempt at a universal API. It is not going to be the one API, or protocol that all biomedical application should implement. It does not require existing resource to change their protocol. It does not define an ontology, or prefer one type of ontology over another. And it does not restrict the where, or how data is stored. The core idea behind IRCT is that no API, protocol, ontology, or application is going to be a perfect fit for all biomedical research. However being able to combine data across different resources will provide a powerful tool for researches. The IRCT accomplishes this by allows different resources to be defined – by initial configuration - what they are, what they have, and what they can do. It can support existing resources without requiring them to make any changes, and allows new resources to be quickly integrated. This is accomplished by creating a ‘Resource-Driven API’.

All actions that can be performed by resources from the IRCT are abstracted into a set of definitions. These definitions provide basic information such as what predicates are supported, and can be executed on the resource. Since the IRCT doesn’t restrict what actions, and commands can be performed individual resources can define any number of predicate operators, data types, parameters, and. This model is flexible enough to support several different resource types, but still rigid enough to allow users to quickly create and execute different actions with the IRCT.

After a user executes an action in the IRCT it needs to be passed to the resource(s) that will execute it. It accomplished this by having a set of resource interfaces that translate that action into the protocol of the resource. Each resource interface supports a specific protocol, and can connect to multiple resources of that type. Results from resources that are returned from those actions are also converted into an IRCT result, which can be returned in many different formats. The result set can also be used to feed into other actions, or returned to the end user in a variety of different formats.

System Architecture

Our approach consisted of building four layers: a communication layer that implements the RESTful service for interacting with an end user; a process engine that manages the requests for queries, and process before returning it to the requester; a collection of extensions that can be used to add or remove additional functionality; and a set of resource interfaces that provides a means of communicating between the IRCT and different services such as i2b2/tranSMART, and ElasticSearch.

The communication layer implements a RESTful service that allows for a simple, and secure creation of actions and retrieving results. All communication with the server is handled through a secure connection, and only authorized users are allowed to use the system. RESTful requests from the user are used to create a series of commands that create queries, and processes that are then passed to the process engine. When a query is complete, a user can then request the results that are returned in the a desired format.

The core consists of two major components: an execution handler, and a results handler. The execution handler initiates, and manages processes that are requested by the communication layer. As part of this it ensure that all processes are run on the correct resources through the resource interface and return successful results. The results handler takes the results from the resource interfaces and combines them with other results, and processes before passing them back to the user through the communication layer to the requester.

The resource interface communicates with the different resources in their supported protocol. It accomplished this by converting the query, or process into the native calls for the given resource. When the given process or query is complete it converts the results from the native format to a different formats such as CSV, XML, JSON, and XSLT. Different resources interfaces can be easily created and added to the IRCT.