/ genetic associations

How to add a new data source to the Open Targets Platform - part 1

I have recently been working on a project for Genomics England that involved adding a new data source to the Open Targets Platform. This two-part series of blog posts describes my experience, with the intention that it will serve as a guide for anyone else who needs to do this in future.

A bit of terminology first.

The Open Targets Platform has a number of different data sources, each of which is of a particular data type. For example the data source "GWAS Catalog" is of data type "Genetic associations".

In this post, I will only cover adding a new data source, not a new data type. The Platform has such a wide range of data types already that your new data source will most likely fall into one of the existing data types. This also means that you can use one of the existing data sources in that data type as a template for the data source you would like to add.

You can see a complete list of data sources and data types in the Open Targets Platform documentation.

The pre-requisites

There are a few things you will need to know before you start adding your new data source.

Open Targets Platform Pipeline

The most important is to be familiar with the Open Targets Platform pipeline, which is open source, with extensive documentation in the Open Targets GitHub. Ideally you should be able to run the pipeline from end-to-end. Here are a few pointers to the documentation:

Python and Unix-like command-lines

The pipeline is written in Python. If you are just adding a new data source you probably won't need to write lots of code, but familiarity with the Python ecosystem (package management, virtualenvs etc) will help. You will need to be familiar with running things on the command line in Linux or MacOS.

JSON and schemas

New evidence is pulled into the pipeline in JSON. Open Targets provide the JSON schema which defines how the data should be structured and constraints on some of the fields, as well as a tool for validating any custom evidence you produce.

Docker

Open Targets use Docker containers extensively, for running the pipeline and then serving the data over the REST API and running the web application.

Elasticsearch

Elasticsearch is core to the whole pipeline running process, as well as being the ultimate repository that the REST API and web application depend upon.

A good understanding of your data

Last but not least, the main thing you will need is a good understanding of the data that you want to add to the Open Targets Platform. You will need to make lots of decisions about how your data fields should map to their equivalents in the Open Targets Platform, so the more you know about your data source the better.

Hardware requirements

You will need access to a compute environment where you have the ability to install software via sudo. I suggest running Elasticsearch on a separate instance. Running the pipeline currently requires a machine with about 16 cores and 250Gb of memory, although the Open Targets team are working to reduce this requirement.

Steps

The high-level overview of the steps you will need to go through is:

  1. Prepare your data
  2. Convert your data to JSON and ensure it validates
  3. Modify the pipeline to include your new data source
  4. Run the pipeline and check that the results are what you expect
  5. Configure the API and web application to include your new data source
  6. Bask in the glory of your new data source being included in the excellent Open Targets Platform

I will go into more detail on each of these steps below.

1. Prepare your data

This step will vary depending on your data source. You will need to extract the data into a format that you can then convert to JSON, according to the Open Targets Platform JSON schema. In my case, I exported the data from an instance of LabKey to a TSV file.

2. Convert your data to JSON

Once you have your data, you will need to transform it to JSON that is valid according to Open Targets JSON schema. The schemas are different depending on the data. The Genomics England data I was working with was a type of genetic association data; so I used the genetics.json schema from the set of JSON schemas that Open Targets make available.

It took me a while to get my head around the structure of the schema, since each schema file includes other schema files. It was very useful to have an existing data source to use as a template; I picked EVA, the European Variation Archive, since it was the closest to my data.

I wrote a Python script that built up a Python data structure of the correct form, then used the built-in JSON library to dump it to a file for validation. This took a bit of trial and error, which is always fun!

2.1 JSON structure

The structure of your data will differ depending on the data type. For the genetic association data I was dealing with, each piece of evidence had the following overall structure:

  • Target - the target, in my case a gene specified by an Ensembl gene stable ID
  • Variant - the variant that is associated, in my case a SNP defined by a dbSNP ID
  • Disease - the disease, defined by an EFO ID (see phenotype mapping below)
  • Evidence - the evidence that links the gene to the variant and the variant to the disease, as well as the score
  • Unique association fields - the fields that make this individual piece of evidence unique

You will need to both generate the correct structure and decide which of your data fields best fits each field in the JSON. The schema can help here, as for some fields it gives an enumeration of the possible field values.

2.2 Validation

Once you have a JSON file with your data, it is time to try validating it. I found that trying to validate just one or a handful of evidence items at a time was the best approach, otherwise it is easy to get swamped by validation errors.
Open Targets provide a tool which you can use to validate your JSON. This uses the same validation code as the pipeline, so if the tool says it is valid, you can be pretty sure that it will run through the pipeline. Note that the Open Targets validator expects a single JSON object per line, so pretty-printed JSON structures won't be valid, and will cause lots of validation errors.

2.3 Scoring

Target to disease association scores are fundamental to the way Open Targets generates associations for its Platform. You will need to come up with a scoring mechanism that is appropriate for your data. As with data types, there is already a library of types of scores that Open Targets supports - p-values, rank, probability and summed total. I chose probability since I was able to generate scores from Genomics England in the range 0 to 1.

2.4 Phenotype mapping

In many cases you will have phenotypes (diseases or otherwise) which are either expressed in text form, or denoted by an identifier from a particular ontology. Open Targets uses the Experimental Factor Ontology, so you'll need to be able to map your phenotypes to EFO IDs. Ontology mapping is notoriously difficult; fortunately Open Targets provide the OnToma tool which does a lot of the heavy lifting for you. I used it to create a list of textual phenotype to EFO mappings, which I cached and used in the Python script I referred to above.

2.5 Unique association fields

The unique_association_fields section of the schema took me a while to get my head around. It is used to specify a set of fields which, when their values are taken together, are enough to uniquely identify this piece of evidence. For the technically minded, the fields are hashed together to produce a digest that is used as part of Open Targets' duplicate detection.

2.6 Include data to build links back to original source

You will probably want to include fields which are used to link individual pieces of evidence back to their source in your organisation, so be sure to include them in your JSON. There is a mechanism for using these fields to construct URLs which point to the appropriate original source, and the links are displayed alongside the data in the associations table.

3. Modify pipeline

OK, so you have your JSON, it contains all the data you need, and it validates. The next step is to modify the pipeline to include your new data source. I'll use my data source, genomics_england_tiering as an example. Remember it is of type genetic_association.

Note: I am working with the Open Targets team to come up with a way to make it easier to add new data sources without modifying the code. Watch this space!

Assuming you have checked out the data_pipeline repository from GitHub, you only have to change one file, namely mrtarget/Settings.py If you look at this file, you'll see a section which defines a Python dict called DATASOURCE_TO_DATATYPE_MAPPING. Add your data source to the dictionary, with the key being the data source name and the value being the data type; for example:

DATASOURCE_TO_DATATYPE_MAPPING['genomics_england_tiering'] = 'genetic_association'`

Be consistent about using the same data source name for the changes you make.

If you use version 18.10 of the pipeline or before, there is one more step to include: add your new data source to the end of the GLOBAL_STATS_SOURCES_TO_INCLUDE list. In my case, this ended up looking like:

GLOBAL_STATS_SOURCES_TO_INCLUDE = ['expression_atlas', 'phenodigm', 
                                   'chembl', 'europepmc',
                                   'reactome', 'slapenrich',
                                   'intogen', 'eva_somatic',
                                   'cancer_gene_census','uniprot_somatic', 
                                   'eva', 'gwas_catalog', 'uniprot',
                                   'uniprot_literature', 'gene2phenotype',
                                   'phewas_catalog', 'genomics_england',
                                   'genomics_england_tiering']

And that should be it!

To be continued

In the second part of this blog series, I will be covering the remaining steps of this process. Stay tuned to learn how to:

  1. Run the pipeline and check that the results are what you expect
  2. Configure the API and web application to include your new data source
  3. Bask in the glory of your new data source being included in the excellent Open Targets Platform

I will also suggest ways of contributing back to Open Targets. In the meantime, if you have any questions, feel free to email me - contact details are in my bio below.

Glenn Proctor

Glenn Proctor

Glenn is an independent consultant. He worked on the Ensembl project, then led software development at Eagle Genomics. He helps clients with product advice, cloud strategy & software implementations.

Read More