Sharing is caring: why we need more freely available cancer GWAS summary statistics
The NHGRI-EBI GWAS Catalog is a central resource for the Genome Wide Association Study (GWAS) community, demonstrating the benefit of expert data curation and integration of full p-value GWAS summary statistics into a central repository for variant-trait associations.
For more than 10 years, the Catalog has aimed to make GWAS data FAIR (Findable, Accessible, Interoperable and Re-usable), while serving as a starting point for investigations to identify causal variants, calculate disease risk, understand disease mechanisms and establish targets for novel therapies.
Here we aim to illuminate progress that has been made to incorporate GWAS summary statistics into the GWAS Catalog, and reach out to the cancer genetics community in particular to promote submission of summary statistics for GWAS studies in cancer cohorts.
The GWAS Catalog summary statistics repository
Three years ago, the GWAS Catalog and Open Targets started to work together to expand the Catalog’s scientific scope to include full p-value summary statistics (aggregate p-values and association data for every variant analysed in a genome-wide association study) in addition to the manually curated top associations.
Summary statistics provide more detailed GWAS results than a lead SNP table, which allows other scientists to better use the data for sophisticated downstream analysis, whilst maintaining patient privacy. For example, summary statistics are integrated into the Open Targets Genetics Portal, where they are used to narrow down candidate loci and prioritise new drug targets through PheWAS, colocalisation analysis, fine mapping and mendelian randomisation.
The GWAS Catalog is now one of the largest, most visited and most frequently updated resources of freely available GWAS summary statistics. The repository includes summary statistics from 3,623 independent analyses (from 447 publications), accounting for a total of more than 22,000 datasets from a wide variety of traits. The Catalog users can easily access and download summary statistics from the GWAS Catalog FTP site or via a dedicated summary statistics API.
The increasing trend to share these datasets is reflected in the availability of summary statistics in the GWAS Catalog over time. There was also a significant increase in data downloads of summary statistics in 2020 compared to the previous year.
Why are few open access cancer GWAS summary statistics available?
This expansion of the GWAS Catalog summary statistics repository, while promising, required a considerable outreach effort from the Catalog data team and the main stakeholders, including Open Targets. Interestingly, the rate of summary statistics data sharing noticeably differs among different genetics cohorts and research groups, with the lowest submission rate in cancer genetics.
For example 33% of epilepsy and 21% of diabetes summary statistics are publicly available, compared to only 7.5% for cancer. The trend slightly differs amongst different cancer groups, and we have observed a more positive trend for the few papers published in the last two years.
In a recent Twitter poll, we asked the community to tell us what they thought was the most important reason for the low rate of cancer summary statistics submission.
The results suggest that the main barriers to the sharing of summary statistics are:
- results are usually embargoed for use in future research,
- data privacy issues (e.g. patient confidentiality agreement), and
- lack of awareness/knowledge on the appropriate data repository.
We believe it is necessary for the genetics community to have a more comprehensive and transparent discussion about such barriers, starting from the privacy issues. As highlighted in the recently revised NIH genomic data sharing (GDS) policy, summary statistics do not include individual-level information and can empower researchers to determine which genomic variants potentially contribute to a disease
How can we promote sharing?
As a part of a wider plan to identify and remove barriers to data sharing among different research communities, we have launched a dedicated public engagement campaign to motivate and encourage the genetics community to make their summary statistics publicly available – and make the difference in the field. In particular, following up from a pressing request from our users and stakeholders, we hope to understand the reasons why so few summary statistics are available for cancer studies. Open sharing of cancer GWAS data will lead the way to new, improved therapies and shed some light on the molecular mechanisms involved in such complex disease.
We call on researchers, journals, funders and charities to support our cause and spread our message. The Nature Journals Group are already proactively supporting an open access policy for their published cancer GWAS data. We hope to get more journals on board, including the top cancer journals.
In an aim to facilitate data sharing and interoperability, the GWAS Catalog team has recently released a web-based deposition interface to support scalable author submission of summary statistics and metadata from published and pre-published (submitted at the time of journal submission upon request from the reviewers) GWAS. The datasets are submitted in a standard format and harmonised against the latest genome build and forward strand. This facilitates downstream analysis and integration into resources like the Open Targets Genetics.
Please submit your summary statistics to the GWAS Catalog.
What do you think are the most important barriers to the open sharing of summary statistics in cancer studies? Please let us know your thoughts by completing this short survey.
You can also reach out to us via Twitter (@GWASCatalog) or email (gwas-info@ebi.ac.uk).
Together, we can make the difference.