Summer School: A Crash Course in Open Targets, Part 2

Genetics Deep Dive

This blog post was written by Clare West, and originally published on her blog. We have reposted it here with her permission.

In Part 1 of this crash course, we learnt how to browse and access data in Open Targets through the web interface and the GraphQL APIs. In Part 2, I’ll go into more detail about the kind of data that’s available, and demonstrate how you might explore the genetic evidence linking targets and diseases.

A Crash Course in Open Targets, Part 1
Clare West guides you through how to access data from the Open Targets Platform and Genetics Portal using the web interface and GraphQL API.
Missed Part 1? Read it now!

Querying evidence for target-disease associations

In one of the Part 1 examples, I demonstrated how to retrieve the top 5 targets associated with Coronary Artery Disease using R:

library(dplyr)
library(ghql)
library(jsonlite)

## Set up to query Open Targets Platform API
otp_cli <- GraphqlClient$new(url = 'https://api.platform.opentargets.org/api/v4/graphql')
otp_qry <- Query$new()

## Query for targets associated with a disease
otp_qry$query('simple_query', 'query simpleQuery($efoId: String!){
  disease(efoId: $efoId){
    name
    associatedTargets{
      rows{
        target{
          id
          approvedName
        }
        datatypeScores{
          id
          score
        }
      }
    }
  }
}'
)

## Execute the query
variables <- list(efoId = 'EFO_0001645')
result <- fromJSON(otp_cli$exec(otp_qry$queries$simple_query, variables, flatten = TRUE))$data$disease

top_targets <- 
  as.data.frame(result$associatedTargets$rows) %>% 
  flatten() %>% 
  tidyr::unnest(datatypeScores) %>% 
  tidyr::pivot_wider(names_from = "id", values_from = "score")

head(top_targets, 5)
## # A tibble: 5 x 6
##   target.id target.approved… known_drug literature genetic_associa… animal_model
##   <chr>     <chr>                 <dbl>      <dbl>            <dbl>        <dbl>
## 1 ENSG0000… guanylate cycla…      0.995     0.0304            0.778       NA    
## 2 ENSG0000… apolipoprotein E     NA         0.948             0.855        0.511
## 3 ENSG0000… proprotein conv…      0.937     0.937             0.755        0.262
## 4 ENSG0000… LDL receptor re…     NA         0.518             0.830        0.423
## 5 ENSG0000… phosphodiestera…      0.970     0.0469            0.407       NA

Let’s take a deeper look at the evidence supporting these associations. There are many fields available in the evidence entity, representing the many different sources of evidence that can contribute to an association in Open Targets.

In this example, let’s look at the GWAS evidence linking these top 5 targets to Coronary Artery Disease.

  • datasourceIds: ["ot_genetics_portal"] restricts the query to evidence from Open Targets Genetics Portal.
  • The field diseaseFromSource will tell me the actual trait of the GWAS study, while disease.id and disease.name represent the EFO term to which this has been mapped. By default, evidence linked to descendant terms of my EFO query term is included in associations - this is known as ‘indirect evidence’, and you can read more about whether you should care about this here. In this example, GWAS evidence for descendant terms such as “Coronary atherosclerosis” have been included in the output of my search for “Coronary artery disease”.
  • score is the resource score - in this case, it’s the locus-to-gene score from the genetics portal, which represents the confidence that this trait-associated locus is linked to this target/gene.
  • The other fields I’ve included here are the GWAS study ID and the year it was published, the effect size (beta or odds ratio), the lead variant IDs, and the predicted functional consequence of the variant.
## Query for genetic evidence
otp_qry$query(
    'genetic_evidence_query',
    'query geneticEvidenceQuery($efoId: String!, $ensemblIds: [String!]!){
  disease(efoId: $efoId){
    evidences(ensemblIds: $ensemblIds,
    datasourceIds: ["ot_genetics_portal"])
    {
      rows{
        target{
          id
          approvedSymbol
        }
        disease{
          id
          name
        }
        score
        diseaseFromSource
        studyId
        publicationYear
        oddsRatio
        beta
        variantId
        variantRsId
        variantFunctionalConsequence{
          label
        }
      }
    }
  }
}
')

## Execute the query
variables <-
    list(efoId = 'EFO_0001645',
         ensemblIds = head(top_targets, 5)$target.id)
result <-
    fromJSON(otp_cli$exec(otp_qry$queries$genetic_evidence_query, variables, flatten = TRUE))$data

top5_evidence <-
    as.data.frame(result$disease$evidences$rows) %>% flatten() %>% head(10)

As an example, let’s look at the sixth piece of evidence, which shows that a 2021 GWAS study found a G to T missense variant in the gene for PCSK9 on chromosome 1 to be associated with an increased risk of a major coronary heart disease event (OR 0.79).

interesting_evidence <- top5_evidence[6,] 
as.list(interesting_evidence)
## $score
## [1] 0.806007
## 
## $diseaseFromSource
## [1] "Major coronary heart disease event"
## 
## $studyId
## [1] "FINNGEN_R5_I9_CHD"
## 
## $publicationYear
## [1] 2021
## 
## $oddsRatio
## [1] 0.7875724
## 
## $beta
## [1] NA
## 
## $variantId
## [1] "1_55039974_G_T"
## 
## $variantRsId
## [1] "rs11591147"
## 
## $target.id
## [1] "ENSG00000169174"
## 
## $target.approvedSymbol
## [1] "PCSK9"
## 
## $disease.id
## [1] "EFO_0001645"
## 
## $disease.name
## [1] "coronary artery disease"
## 
## $variantFunctionalConsequence.label
## [1] "missense_variant"

Open Targets Genetics Portal

Once you’re delving deep enough into GWAS evidence, you might want to make the jump to the Genetics Portal. The Portal also has a GraphQL API, but the data is structured around variant, study, and gene entities. Unlike the Platform, where evidence for the same disease is grouped together, GWAS studies are kept separate.

The following query will get more information about the GWAS evidence above linking PCSK9 to coronary artery disease. Firstly, it retrieves details about the sample size, ancestry, and total number of associated loci from the GWAS study. It also retrieves all other possible gene mappings for the locus, including the locus-to-gene score, the distance to the gene, and whether there is molecular trait colocalisation linking this locus to the gene. In this case, the locus-to-gene scores are very low for genes other than PCSK9 (only locus-to-gene scores greater than 0.05 will be used as evidence in the Platform):

## Set up to query Open Targets Genetics API
otg_cli <- GraphqlClient$new(url = "https://api.genetics.opentargets.org/graphql")
otg_qry <- Query$new()

## Query for GWAS study locus details
otg_qry$query('l2g_query', 'query l2gQuery($studyId: String!, $variantId: String!){
    studyInfo(studyId: $studyId){
    numAssocLoci
    ancestryInitial
    nTotal
    nCases
    pubAuthor
  }
  studyLocus2GeneTable(studyId: $studyId, variantId: $variantId){
    rows {
      gene {
        id
        symbol
      }
      hasColoc
      yProbaModel
      distanceToLocus
    }
  }
}')

## Execute the query 
variables <- list(studyId = interesting_evidence$studyId, variantId = interesting_evidence$variantId)
result <- fromJSON(otg_cli$exec(otg_qry$queries$l2g_query, variables, flatten = TRUE))$data
result$studyInfo
## $numAssocLoci
## [1] 23
## 
## $ancestryInitial
## [1] "European=218792"
## 
## $nTotal
## [1] 218792
## 
## $nCases
## [1] 21012
## 
## $pubAuthor
## [1] "FINNGEN_R5"
result$studyLocus2GeneTable
## $rows
##            gene.id gene.symbol hasColoc yProbaModel distanceToLocus
## 1  ENSG00000006555       TTC22    FALSE 0.012453916          238651
## 2  ENSG00000116133      DHCR24    FALSE 0.028346576          152779
## 3  ENSG00000143001      TMEM61    FALSE 0.071031295           59346
## 4  ENSG00000162390      ACOT11    FALSE 0.008267773          497717
## 5  ENSG00000162391     FAM151A    FALSE 0.007535086          416418
## 6  ENSG00000162396       PARS2    FALSE 0.009444200          275451
## 7  ENSG00000162398        LEXM    FALSE 0.014227788          233911
## 8  ENSG00000162399        BSND    FALSE 0.057393700           41781
## 9  ENSG00000162402       USP24    FALSE 0.074823588          175390
## 10 ENSG00000169174       PCSK9    FALSE 0.806006968             426
## 11 ENSG00000184313       MROH7    FALSE 0.007656511          398220
## 12 ENSG00000243725        TTC4    FALSE 0.009801239          324113
## 13 ENSG00000271723  MROH7-TTC4    FALSE 0.008091008          398188

Lead and tag variants

Time for a quick genetics refresher…

The lead variant reported for a GWAS association is usually the SNP with the smallest p-value at the locus (i.e the most significant SNP), but this is not necessarily the causal variant. This is important when investigating individual genetic associations and making comparisons across studies. The causal variant may be a nearby less-significant SNP, or may be an unmeasured SNP that correlates with the lead SNP but was not included in the GWAS microarray.

SNPs are correlated if they are inherited together more frequently than would be expected by chance, known as Linkage Disequilibrium (LD). The extent of LD between SNPs depends on the distance between them, how often recombination occurs in the genomic region, as well as population structure. In regions where large genomic units are frequently inherited together in the study population, there can be a large number of possible causal variants that are difficult to disentangle. Authors are increasingly encouraged to deposit full summary statistics - which include p-values and effect sizes for all SNPs measured in the GWAS study - but for many older studies only the lead variant information is available.

In Open Targets Genetics, the lead variants are expanded into a more comprehensive set of candidate causal variants referred to as the tag variants. For studies where summary statistics are available, fine-mapping is used to identify a credible set of possible causal variants based on the GWAS results. Where summary statistics are not available, tag variants include those that are highly correlated (r2>0.7) with the lead variants; LD information is calculated from a reference population that most closely matches the study population’s ancestry.

So what?

So, there’s more to GWAS than lead variants!

The following query retrieves the credible set of variants for one of the GWAS associations linking APOE to Coronary Artery Disease (in this case, there are three variants):

## Query for GWAS study locus details
otg_qry$query('credset_query', 'query credsetQuery($studyId: String!, $variantId: String!){
  gwasCredibleSet(studyId: $studyId, variantId: $variantId) {
    tagVariant {
      id
    }
    beta
    postProb
    pval
  }
}')

## Execute the query 
variables <- list(studyId = interesting_evidence$studyId, variantId = interesting_evidence$variantId)
result <- fromJSON(otg_cli$exec(otg_qry$queries$credset_query, variables, flatten = TRUE))$data

result$gwasCredibleSet %>% flatten()
##                   id    beta   postProb      pval
## 1     1_54982575_G_A -0.2309 0.16240063 5.138e-11
## 2     1_55039974_G_T -0.2388 0.76268851 1.247e-11
## 3 1_55101964_CTTGA_C -0.2348 0.03686097 2.472e-10

Querying colocalisation information

Colocalisation analysis is performed between all studies in the Portal with at least one overlapping associated locus. This analysis tests whether two independent associations at the same locus are consistent with having a shared causal variant. Colocalisation of two independent associations from two GWAS studies may suggest a shared causal mechanism. Colocalisation of a locus associated with a trait (through a GWAS) and with protein levels (through a pQTL study) may suggest a link between the protein and the trait.

For example, for the top 5 targets linked to Coronary Artery Disease that I retrieved earlier from the Platform, I can see whether there is evidence of colocalisation with loci associated with a change in protein or expression levels. This query will retrieve the lead variant, effect size, tissue, and study ID for QTL studies for which there is evidence of colocalisation:

## Query for QTL colocalisation
otg_qry$query(
  'qtl_query',
  'query qtlColocalisationVariantQuery($studyId: String!, $variantId: String!) {
  qtlColocalisation(studyId: $studyId, variantId: $variantId){
    qtlStudyName
    phenotypeId
    gene {
      id
      symbol
    }
    tissue {
      name
    }
    indexVariant {
      id
    }
    beta
    h4
  }
}'
)

fetch_qtl <- function(current_studyId, current_variantId) {
  variables = list(studyId = current_studyId, variantId = current_variantId)
  result <-
    fromJSON(otg_cli$exec(otg_qry$queries$qtl_query, variables, flatten = TRUE))$data
  l2g_result <- result$qtlColocalisation
  return(l2g_result)
}

variants <- top5_evidence %>%
  select(studyId, variantId) %>%
  unique()

variants_qtl <-
  variants %>%
  rowwise() %>%
  mutate(qtl = list(fetch_qtl(studyId, variantId))) %>%
  tidyr::unnest(qtl) %>%
  select(-qtl) %>%
  flatten()
##                         studyId        variantId qtlStudyName     phenotypeId
## 1 FINNGEN_R5_I9_CORATHER_EXNONE 4_155762333_AT_A      GTEX_v7 ENSG00000164116
## 2 FINNGEN_R5_I9_CORATHER_EXNONE 4_155762333_AT_A      GTEX_v7 ENSG00000260244
## 3 FINNGEN_R5_I9_CORATHER_EXNONE 4_155762333_AT_A      GTEX_v7 ENSG00000260244
## 4        FINNGEN_R5_I9_CORATHER 4_155762333_AT_A      GTEX_v7 ENSG00000164116
## 5        FINNGEN_R5_I9_CORATHER 4_155762333_AT_A      GTEX_v7 ENSG00000260244
##       beta        h4         gene.id gene.symbol   tissue.name indexVariant.id
## 1 0.154862 0.7625397 ENSG00000164116     GUCY1A1 Artery tibial 4_155694252_C_T
## 2 0.116993 0.4844574 ENSG00000260244  AC104083.1 Artery tibial 4_155694252_C_T
## 3 0.181015 0.9095855 ENSG00000260244  AC104083.1          Lung 4_155724361_C_A
## 4 0.154862 0.7935801 ENSG00000164116     GUCY1A1 Artery tibial 4_155694252_C_T
## 5 0.116993 0.4565324 ENSG00000260244  AC104083.1 Artery tibial 4_155694252_C_T

Wow, what else is there?

The Genetics Portal not only aggregates GWAS study data in one place, but performs a number of analyses that can help make sense of the vast genetic data available. In this blog post I’ve demonstrated a few ways this information can be used alongside the Genetics Platform to find and explore genetic links between targets and diseases, but this really just scratches the surface. Find out more about the Genetics Portal in the documentation and the Open Targets Community.

In the next and final part of this blog series, we’ll delve into how you can bypass the APIs completely and get straight to the good stuff with the data downloads.

A Crash Course in Open Targets, Part 3: the data downloads
Clare West navigates the data downloads for the Open Targets Platform and Genetics portal, including data access and querying using Spark.
The final part of the series dives into the data downloads