Case study: Querying the Open Targets Platform using natural language

Case Studies May 25, 2023

This blog post is part of a series that will explore applications and expansions of the Open Targets informatics ecosystem, particularly the Open Targets Platform and Open Targets Genetics through conversations with our users.

Onuralp Soylemez was the Head of Data Science at Global Blood Therapeutics until recently. A research scientist with experience in human genetics driven drug discovery, Onuralp has used the Open Targets informatics tools extensively to curate human genetics evidence to support drug targets, and has previously leveraged data from Open Targets to develop a method to incorporate various types of human genetics evidence to prioritise clinically promising targets^ref.

Inspired by BirdSQL — a search interface to allow users to navigate Twitter’s large datasets — Onuralp wanted to demonstrate the potential of GPT-based conversational search engines in biology, healthcare, and drug discovery to facilitate the flow of information between wet lab scientists and data scientists, geneticists, and machine learning scientists.

Using OpenAI’s Codex model, his search engine translates natural language — for example, “what are the top 3 diseases associated with ABCA4?” — into GraphQL queries submitted to the Open Targets Platform API, and returns the answer (severe early-childhood-onset retinal dystrophy, cone rod dystrophy, and age-related macular degeneration).

Access the model here.

Image of a man with short hair and stubble, looking at the camera and smiling. Caption reads: Onuralp Soylemez created a search engine to translate natural language into GraphQL queries submitted to the Open Targets Platform API

What do you think the limits of this application are?

Text-to-GraphQL is a really interesting challenge for querying domain-specific datasets and has many useful applications in drug discovery. I think we are only limited by creativity and the ingenuity of the questions we can ask. I firmly believe that the best way to crowdsource the most helpful questions is to empower wet lab biologists (and therapeutic area specialists) to adopt these highly capable AI technologies and feel comfortable to converse with their computational biology peers. This application highlights the opportunity to reduce a technical barrier.

How do you think natural language models can help address drug discovery questions?

One use case that I am most excited about is text summarisation. These models have shown impressive performance when summarising long-form texts and highlighting the main ideas. I think that the success of drug discovery relies on the quality of therapeutic or biological hypotheses that we can generate and test effectively.

These language models can help us parse the vast biomedical literature to classify academic papers based on specific hypotheses tested. For example, we can ask these language models to find papers that test a specific hypothesis of our interest, e.g. find papers that test whether statins are effective for lowering cholesterol levels. Querying the literature by specific hypotheses instead of literal keywords can ground the scientific communication and reasoning between scientists from different backgrounds.

Do you think longer dialogues with ChatGPT could help resolve more complex questions?

Absolutely. I think these natural language models can help decompose a complex question into intermediate steps to enable the user to follow the model’s reasoning when generating an answer. Longer, more thoughtful dialogues with the model can reveal the relative utility of model-generated answers across a wide range of drug discovery tasks.

Unfortunately, this process of finding the best way to prompt a conversational engine is currently very artisanal and requires lots of experimentation — luckily, what a better crowd than drug hunters to experiment and figure out the most helpful use cases.

Reference.

Soylemez, O. (2022). Bayesian tensor factorization for predicting clinical outcomes using integrated human genetics evidence. arXiv preprint arXiv:2207.12538.