Rewriting the Open Targets Platform's Drug Index

Open Targets' mission is to provide world-class tools to the research community that facilitate the identification and development of new pharmaceutical products by prioritising targets for drug discovery. The project aims to link points on the genome (targets) to specific diseases, and help find drugs that can influence those targets.

One part of that project is maintaining an index of molecules with pharmaceutical properties — drugs. Open Targets uses ChEMBL as a base source of information of available molecules, and winnows the approximately two million compounds made available by ChEMBL down to almost 12 000 drugs. We define a drug to be any molecule that meets one or more of the following criteria:

There is at least 1 known indication (disease);
There is at least 1 known mechanism of action (target); or
The ChEMBL ID can be mapped to a DrugBank ID.

This blog post provides an update on recent work to update the drug index.

Motivation for change

Key motivations for rewriting the previous drug index were to restructure the data to be more flexible and to make it easier for stakeholders to expand with their own data.

The new data structure is broken into three broad tranches: molecules, mechanisms of action, and indications. The three can be combined as necessary using a ChEMBL ID as a linking field. Currently the new drug dataset includes:

11 618 molecules;
6633 indications; and
3985 mechanisms of action.

A second motivation was to make the data more easily expansible for our users. Having broken one large structure into three thematic groupings helps here as a smaller number of required fields must be supplied to add in new data.

For users who run their own instances of the ETL and Platform there is now also the ability to add additional data using external files. The ETL configuration has a field called drug-extensions which can be used to specify files including additional synonyms and cross-references. The project's readme contains detailed instructions regarding required fields and formats to make use of this functionality.

Changes from previous data versions

Users may notice some minor changes compared to previous versions of the platform.

Most importantly, the old index collated related molecules into a single parent and discarded the child molecules. In the new version we retain this relationship information. Through the GraphQL interface it is possible to see that Drug objects include both parentMolecule and childMolecules fields which resolve either a drug's hierarchical parent or children as necessary. The most obvious effect of this change is that the number of indications may decrease on some molecules, as those are now listed as belonging to the child molecule.

A field internalMolecule has been deprecated and is no longer present.

Technical changes

This update required a number of updates throughout our project, from collecting the raw data through to presenting it graphically.

The drug index's raw ChEMBL inputs are now retrieved from an Elasticsearch instance rather than via a REST API. By selecting only the fields which we require for Open Targets' ETL we increase the speed of data collection, reduce transmission and storage costs, and importantly improve transparency without our own programs of the data that is required for downstream processing.

For our own data processing we moved from using Python to Scala and Spark. As well as increasing the speed of execution, the added type safety of Scala makes the code easier to refactor and expand upon in future.

Lastly the platform's GraphQL interface was updated to give access to the new data model. The GraphQL interface provides an abstraction over the previously discussed ternary data model, and presents the user access to a unified 'drug' entity.

Future focus

The usefulness of the drug index depends on its quality, scope and timeliness; adding new data as it is created by the scientific community in a timely manner and incorporating the work of other research groups. To this end our data team continues to work with our colleagues at ChEMBL to increase the amount of data available via Open Targets, as well as with our partners to prioritise new improvements and enhancements.

The new drug index will be included from release data 20.11 onwards, and is used to support the new user interface and the associated GraphQL interface, both currently in beta and subject to changes.

To find out about other changes we've implemented in the beta release, take a look at the release notes, and let us know what you think.