Explore more than 500 million connections
Today we release a new tool called LINK (LIterature coNcept Knowledgebase) that allows the exploration of half a billion relations between genes, diseases, drugs and key concepts extracted from PubMed abstracts using NLP (Natural Language Processing).
Once MEDLINE relaxed their license for obtaining and analysing publication data last year, we started looking for novel ways to mine their data. We wanted to exploit the biomedical knowledge often buried in the literature to help scientists find the information to generate new hypotheses for the identification of new drug targets.
For this purpose, we have built Library, an open source ecosystem comprising:
A pipeline that allows us to quickly run a large scale NLP analysis
An API that serves the resulting data
A web interface to explore this data
Our pipeline annotates genes, diseases and drugs present in PubMed abstracts, and extracts key concepts. Stay tuned for our following posts with more details.
We use the same framework to extract semantic relations between these entities in the form of subject-predicate-object. When we run this analysis on the current PubMed release, we generate over half a billion (!!) of these connections.
Taken together the relationships between genes, disease and drugs form a comprehensive graph of biomedical knowledge.
What we release today is a tool to explore the knowledge graph we built, and to allow our users to get to the tiny bit of detail of every sentence we found a relation for.
Why don't you try LINK now? Type in the entity or concept you are interested in, select it from the dropdown list and press GO.
You can also search for more than one entity at the same time, or any free text really.
You will get (after a little while) a graph of entities related to your queries, so that you can start to explore your hypotheses.
Get an overview of a topic
You can use LINK in many different ways. You can start by querying it with a broad topic such as BRCA1, a widely studied gene related to breast cancer.
In the graph above you can see the genes, diseases, drugs, and key some concepts related to your query. By default the size of the nodes in the graph is linked to the importance of the node (Pagerank score), and the width of the edge is linked to the amount of underlying information available.
The entities are also listed below the graph, for convenience, so that it is easier for you to read all the labels and select the ones you want to explore more in depth.
When you select a node, you will see more details about your query. The entities that are not directly connected to it will vanish from the graph, and will be greyed out in the list. If another node is selected, the shortest path connecting all of them will be highlighted.
If you want to validate a connection, you can drill down to the detail of each entity connection by clicking on the graph edges or the red arrows in the selected path.
This will prompt all the sentences that we have annotated with a semantic relation between the two nodes selected.
We have made the design decision to present only the most significant entities for each query by default. But you can easily customise the view, as LINK gives you the choice to increase the number of entities displayed, to show just some of them, or to change the scoring and the significance methods.
You can for example increase the number of nodes, and restrict the entity types to just show genes.
The output of this query would be a graph of gene-to-gene relations that will give you an overview of the genes involved, and could be further explored in detail to support or generate hypotheses.
Bottom up search
Another workflow that we found LINK rather useful was to start off with a detailed topic, like a publication that we know of, or a genetic variant we are interested in (e.g. a SNP ID). We can then understand the general landscape for that publication, SNP or any other specific topic.
You can start from a paper that you have just come across with, let’s say this one that discuss the role of the gene ADAM10 in brain diseases.
If you paste the PubMed id of the paper into the search box and press GO, LINK will start with the relations found in that paper and extend the search to those related to them.
By focusing just on genes and diseases (with the advanced search options), we can get a graph like the one below, describing the genes and diseases linked to ADAM10
By clicking on the Alzheimer's disease node, it is easy to understand that the ADAM10 involvement in Brain diseases is just part of its role in the disease space.
Looking at the full graph you might easily spot that ADAM10 has for example a role in melanoma as well.
Want to get into more details about it? Follow the link to the Open Targets platform to read everything we know about these associations.
LINK is released as a proof of concept. We want to hear your feedback, so please leave your comments below or send us an email instead.