Presenting Open Targets to the World via Play, Elasticsearch, Clickhouse and GraphQL

Open Targets Platform May 27, 2021

The third part of our four-part series discussing the Open Targets Platform explores our GraphQL service and supporting application.

The GraphQL interface is how our front-end (the subject of the next post) team accesses and queries the data generated by the ETL and one more way in which users can interact with Open Targets.

In the previous version of the Open Targets platform this interface responsibility was provided a REST API service which exposed Open Targets' resources. There were a number of drawbacks to this approach, namely:

Whenever the back-end team changed the REST API, there was the potential to break the Open Targets Platform itself, and any third-party code using the REST API;
It often required many REST calls to obtain all the data needed for a query.

For the new platform, the back-end team decided to cut the millstone of REST from around our development necks and change to the approach provided by GraphQL. GraphQL is a query language developed by Facebook. It's a replacement for traditional REST APIs and is far more flexible. With GraphQL you can query multiple resources with a single request to the server, an ideal feature for web applications such as Open Targets platform.

The data within Open Targets reflects real world connections between things, and this relationship was hard to capture using REST. For example, suppose I am interested in information about drugs that have a known relationship with a disease. Using a REST service I would query the 'disease' object, and extract from that a list of drug entity identifiers. I would then execute a query to obtain each of those 'drug' objects. Using GraphQL I can query a disease, and in the same query specify exactly which fields I want so I can get all of the drugs which are related to that disease. What's more, I don't need to process all of the data on each object, I just state exactly which fields do interest me, and the object queries (an amalgamation of disease and drug entities) are returned to the user.

Accessing Open Targets data via GraphQL is covered in detail in the documentation, but a small example never goes astray:

query diseaseAndDrugs {
  disease(efoId: "EFO_0000249"){
    id
    name
    knownDrugs {
      uniqueDrugs
      rows {
        drug {
          id
          name
          isApproved
        }
      }
    }
  }
}

Executing this query in the Open Targets GraphQL playground will show you drugs known to be associated with Alzheimer's disease without having to query the disease, parse the results, generate a new query, and then query again for drugs. For a client application, you know that the data you receive will match the format of your query making it easier to interface with and explore the data.

The query can also be run as part of a program, or from the command line using a utility such as 'curl' using the endpoint https://api.platform.opentargets.org/api/v4/graphql.

Moreover, GraphQL is, broadly speaking, technology agnostic and didn't tie us into using any specific technology stack. As a user you can ask server for connected data and you'll get in response only what you've asked for. As a developer you can combine different back-end storage solutions to provide an optimal user experience.

To setup GraphQL back-end server we used the Play Framework with Sangria. Play is a very well supported and documented framework, and makes integration with other Scala or Java code easy.

The data that is produced by the ETL is variously stored in Elasticsearch or Clickhouse. The majority of our data is in Elasticsearch which provides excellent indexing and search capabilities. In cases where we execute very complex queries and Elasticsearch is too slow, we use Clickhouse, a column-orientated database.

Moving forward we want to improve our continuous integration / continuous deployment practices so that with the push of a button we can deploy a new GraphQL API backed by Elasticsearch and Clickhouse.