How to access Open Targets with R

You can access the Open Targets data using the web application or with our public REST API, for which we've posted the Target Validation API Tutorial: Getting Started.

We now describe a package that implements an R client to extract data from the Target validation platform, the ropentargets.

Installing the R client

ropentargets is available from our Open Targets GitHub repository. You can install it using the R command line or RStudio. You should have R version 3.2.0 or above. You will also need the devtools package.

install.packages("devtools", dependencies = TRUE)  
library(devtools)  
install_github("cttv/ropentargets")  

The source code

We've written ropentargets using R reference classes. Each class is defined in its own file in a directory named R. There is also a directory named scripts that contains R script examples for each of the classes we discuss below.

If you want to inspect the source, clone the copy from our GitHub repository (https://github.com/CTTV/ropentargets), then use the devtools::test package to run the tests. If all the tests are successful, you are ready to start using our ropentargets.

Getting summary counts

After importing ropentargets, instantiate the class DataStats and invoke both getRESTAPIVersion and getSummaryCountsAsDataFrame methods to retrieve:

  1. The API version
  2. A data frame with summary counts

Here is the code for these tasks:

library(ropentargets)  
dataStatsObj <- ropentargets::DataStats$new()  
apiVersion <- dataStatsObj$getRESTAPIVersion()  
summaryCountsAsDataFrame <- dataStatsObj$getSummaryCountsAsDataFrame()  

The data frame summaryCountsAsDataFrame will look like the following in RStudio:

This data frame provides a convenient overview of what is available in our platform.

Finding out more about a gene

Use the Gene class to discover more about your favourite gene. If you know the Ensembl gene ID for your gene of interest, you can retrieve the data frame as follows:

library(ropentargets)  
ensemblGeneID <- 'ENSG00000167207' # "NOD2"  
geneObjNoName <- ropentargets::Gene()  
cutoffScore <- 0.2  
diseasesEvidenceForGeneAsDataFrame <- geneObjNoName$getEvidenceDiseasesForGeneAsDataFrame(ensemblGeneID, cutoffScore)  

If you don't know the Ensembl gene ID, you can create an instance of Gene passing a gene symbol or name as an argument to new. The underlying Elasticsearch engine is then queried to return an list of matches. For an official gene symbol or name, the first element in the returned list can be queried to give the Ensembl gene ID. This can then be used to generate the data frames as shown above.

Let's start with a gene name and use the search results to match them to an Ensembl gene ID:

library(ropentargets)  
geneName <- 'NOD2'  
geneObj <- ropentargets::Gene$new(geneName)  
# See what the search engine returned by printing the list returned by the following method call.
geneObj$getGeneIDNameMap()  
# Get the Ensembl gene ID for the first match in the list.
ensemblGeneID <- geneObj$getFirstEnsemblGeneID()  
# Get summary of information for this gene.
firstGeneSummary <- geneObj$getFirstGeneSummaryList()  
# The list 'firstGeneSummary' can be inspected in RStudio or printed.

Once you've got the Ensembl gene ID, you can methods that will return data frames with evidence and association data linking gene to diseases.

How to get the evidence:

# Set a cut-off score (0.2 is a good starting point)
cutoffScore <- 0.2  
# Generate the data frame
evidenceDiseasesForGene <- geneObj$getEvidenceDiseasesForGeneAsDataFrame(ensemblGeneID, cutoffScore)  

Your output would look like this:

where the columns show:

  1. Gene symbol
  2. Ensembl gene ID
  3. Data source name
  4. Disease name as given in the Experimental Factor Ontology (EFO) label
  5. Disease ID
  6. Association score (https://www.targetvalidation.org/scoring)
  7. Data type, which can have one or more data sources
  8. Unique 32 character string assigned to each evidence loaded into Open Targets

Note: the column named id, we will use these character strings later to return the full evidence string JSONs.

How to get the associations:

You can find the disease associations for the gene of interest with the method getAssociationDiseasesForGeneAsDataFrame, which will return a data frame. We compute associations directly from the evidence available in the platform, and infer them from disease relationships in the EFO. We use a flag in the column named Direct to indicate if we computed the association directly from evidence ('true') or through inference ('false').

diseaseAssocsForGene <- geneObj$getAssociationDiseasesForGeneAsDataFrame(ensemblGeneID, cutoffScore)

In this example, we use the same Gene instance, cut-off and Ensembl gene ID as for the evidence example. The data frame will look like the following:

The gene-disease association string in column 1 is called id. This id is different from the evidence id that we described before. Columns 2 - 8 contain the association scores for each of the data types. Column 9 gives the combined association score. Columns 10 - 13 give the target (gene) and disease symbols and IDs. Column 14 indicates whether the association is direct or indirect. The gene ID - disease ID combination given in column 1 as id will be used in a later example to retrieve an Association object.

Finding out more about a disease

Use the Disease class to find more about a disease. This will look familiar now that you've used the Gene class. If you know the disease EFO ID, you can create the instance without an argument to new. You can create the data frames with evidence and association data as follows:

library(ropentargets)  
efoID <- 'EFO_0000384' # "Crohn's disease"  
diseaseObjNoName <- ropentargets::Disease()  
cutoffScore <- 0.2  
genesEvidenceForDiseaseAsDataFrame <- diseaseObjNoName$getEvidenceGenesForDiseaseAsDataFrame(efoID, cutoffScore)  

If you don't know the disease EFO ID, you can use methods for checking what has been returned by the search engine for the disease name argument. Once you have the disease ID, you can return data frames containing evidence or associations.

Let's take 'irritable bowel syndrome' as an example of a disease of interest:

How to get the evidence:

library(ropentargets)  
diseaseName <- 'Crohns disease'  
diseaseObj <- ropentargets::Disease(diseaseName)  
# Inspect what the search engine has returned for the input disease term
diseaseObj$getDiseaseIDNameMap()  
# Print summary information for the first disease match returned
diseaseObj$getFirstDiseaseSummaryList()  
# Store the EFO ID in a variable
efoID <- diseaseObj$getFirstDiseaseEFOID()  
cutoffScore <- 0.2  
# Create the evidence data frame
evidenceGenesForDisease <- diseaseObj$getEvidenceGenesForDiseaseAsDataFrame(efoID, cutoffScore)  

These are first 10 rows of the resulting data frame:

The columns are the same as we've listed above.

How to get the associations:

associationGeneForDiseaseDirect <- diseaseObj$getAssociationGenesForDiseaseAsDataFrame(efoID, cutoffScore, 'true')  

Exporting a data frame to Excel

If you want to view the data in Excel, you can install and use the xlsx package. You can export all the disease associations for a gene to an Excel spreadsheet with the script below:

library(ropentargets)  
library(xlsx)

ensemblGeneID <- 'ENSG00000171105'  
geneObjNoName <- ropentargets::Gene()  
cutoffScore <- 0.2  
associationDiseaseForGene <- geneObj$getAssociationDiseasesForGeneAsDataFrame(ensemblGeneID, cutoffScore)  
xlFile <- 'ENSG00000171105_DiseaseAssocs.xlsx'  
# File will be written to the directory returned by the R function "getwd()".
write.xlsx(associationDiseaseForGene, xlFile, sheetName="GeneAssocs")

Retrieve individual evidence strings as JSON

We use a unique 32-character string to identify JSON evidence strings. In the gene and disease evidence data frames, they are named id. We can use the id to generate full JSON evidence strings and write them to a file:

library(ropentargets)  
# Create a Gene object
geneName <- 'APOE'  
geneObj <- ropentargets::Gene(geneName)  
# Check that the search engine has returned expected result
geneObj$getFirstGeneSummaryList()  
# Assign the Ensembl gene ID
ensemblGeneID <- geneObj$getFirstEnsemblGeneID()  
# Set a higher cutoff score than before
cutoffScore <- 0.5  
# Get the evidence data frame
evidenceAPOE <- geneObj$getEvidenceDiseasesForGeneAsDataFrame(ensemblGeneID, cutoffScore)  
# Extract the IDs for the top 10 scores and assign them to a vector
idsForTop10 <- evidenceAPOE[1:10,]$id  
# Create the Evidence object
evidenceObj <- ropentargets::Evidence$new()  
# Set the output fiel path
outputFile <- '~/apoe_top10_evidence.txt'  
# Write the output to a file
result <- evidenceObj$writeEvStrAsJsonToFile(idsForTop10, outputFile)  
# check that the return value is TRUE (indicates success)
print(result)  

This script generates a file with two columns. The first column contains the 32-character ID for the evidence string, whereas the second contains the entire JSON evidence string in a single line.

You can extract and format the JSONs using Unix tools and jq, for example:

# jq or a similar tool must be installed.
$ head -1 ~/apoe_top10_evidence.txt | cut -f2 | jq '.' >apoe_evidence_example_formatted.json

jq is a very powerful command line tool for manipulating JSON output and can be used to format and extract sub-sections of the input JSON. If you want to see only part of the JSON document, for example the target section, you can run the Unix command below:

head -1 ~/apoe_top10_evidence.txt | cut -f2 | jq '.target'  

For the JSON document, the following output is obtained:

{
  "target_type": "protein_evidence",
  "gene_info": {
    "symbol": "APOE",
    "name": "apolipoprotein E",
    "geneid": "ENSG00000130203"
  },
  "id": "ENSG00000130203",
  "activity": "up_or_down"
}

The association object

If you know both the Ensembl gene and disease EFO IDs, you can generate the association object (represented as an R list). This object contains a summary of the information present in Open Targets that is used to associate the gene-disease pair. Here is an example of how to use it:

library(ropentargets)  
ensemblGeneID <- 'ENSG00000073756' # ENSG00000073756  
efoID <- 'EFO_0003767' # inflammatory bowel disease  
ensemblGeneID <- 'ENSG00000073756' # PTGS2  
efoID <- 'EFO_0003767' # inflammatory bowel disease  
assocObj <- ropentargets::Association$new(ensemblGeneID, efoID)  
geneDiseaseAssoc <-  assocObj$getAssociationObjectAsList()  
# Extract elements of the 'geneDiseaseAssoc' list
assocTargetInfo <- geneDiseaseAssoc$data[[1]]$target  
assocDiseaseInfo <- geneDiseaseAssoc$data[[1]]$disease  
evidenceCountForAssocTotal <- geneDiseaseAssoc$data[[1]]$evidence_count$total  
dataTypeMaxScores <- assocObj$getDataTypesMaxScores()  

The geneDiseaseAssoc above is rather large and complex. Here we show how to navigate it using standard R list sub-setting. If you prefer, you can write the list to JSON (for example by using the rjson package toJSON method) and then print to a file and format it using a tool such as jq.

What to do if you run into problems?

We have installed and run ropentargets on Mac OS X, Linux (Ubuntu) and Windows, but if you come across any issues or have suggestions, please email us. You can also contact us via Twitter and Facebook for questions and comments.

And don't forget: if you are using ropentargets behind a firewall, you may encounter proxy issues. You can resolve this by executing the following commands in a bash environment:

export http_proxy=http://username:password@server.com:port  
export https_proxy=http://username:password@server.com:port