Graphic inspired by karyotype diagrams in the GWAS Catalog, in which the karyotypes are pie charts.

Sharing is caring: why we need more freely available cancer GWAS summary statistics

GWAS Catalog Mar 11, 2021

The NHGRI-EBI GWAS Catalog is a central resource for the Genome Wide Association Study (GWAS) community, demonstrating the benefit of expert data curation and integration of full p-value GWAS summary statistics into a central repository for variant-trait associations.

For more than 10 years, the Catalog has aimed to make GWAS data FAIR (Findable, Accessible, Interoperable and Re-usable), while serving as a starting point for investigations to identify causal variants, calculate disease risk, understand disease mechanisms and establish targets for novel therapies.

Here we aim to illuminate progress that has been made to incorporate GWAS summary statistics into the GWAS Catalog, and reach out to the cancer genetics community in particular to promote submission of summary statistics for GWAS studies in cancer cohorts.

The GWAS Catalog summary statistics repository

Three years ago, the GWAS Catalog and Open Targets started to work together to expand the Catalog’s scientific scope to include full p-value summary statistics (aggregate p-values and association data for every variant analysed in a genome-wide association study) in addition to the manually curated top associations.

Summary statistics provide more detailed GWAS results than a lead SNP table, which allows other scientists to better use the data for sophisticated downstream analysis, whilst maintaining patient privacy. For example, summary statistics are integrated into the Open Targets Genetics Portal, where they are used to narrow down candidate loci and prioritise new drug targets through PheWAS, colocalisation analysis, fine mapping and mendelian randomisation.

Graphic with statistics about the GWAS Catalog: "The GWAS Catalog repository features 3,623 independent analyses from 447 publications, and a total of 22,000+ datasets."

The GWAS Catalog is now one of the largest, most visited and most frequently updated resources of freely available GWAS summary statistics. The repository includes summary statistics from 3,623 independent analyses (from 447 publications), accounting for a total of more than 22,000 datasets from a wide variety of traits. The Catalog users can easily access and download summary statistics from the GWAS Catalog FTP site or via a dedicated summary statistics API.

The increasing trend to share these datasets is reflected in the availability of summary statistics in the GWAS Catalog over time. There was also a significant increase in data downloads of summary statistics in 2020 compared to the previous year.

Bar graph showing that the proportion of independent GWAS studies with available summary statistics added to the Catalog every year has been increasing since 2017.

Bar graph showing that summary statistics downloads from the Catalog have also been increasing in the past couple of years.They are shown here sorted by the different data formats made available by the GWAS Catalog: harmonised, formatted, or raw.

Why are few open access cancer GWAS summary statistics available?

This expansion of the GWAS Catalog summary statistics repository, while promising, required a considerable outreach effort from the Catalog data team and the main stakeholders, including Open Targets. Interestingly, the rate of summary statistics data sharing noticeably differs among different genetics cohorts and research groups, with the lowest submission rate in cancer genetics.

Bar graph showing the percentage of GWAS studies with publicly available summary statistics for highly represented traits in the GWAS Catalog: cancer, diabetes, BMI, CAD, stroke, autism, epilepsy. Compared to cancer, there are significantly more studies with summary statistics than other studies in the Catalog for diabetes, BMI, CAD, and epilespy.

For example 33% of epilepsy and 21% of diabetes summary statistics are publicly available, compared to only 7.5% for cancer. The trend slightly differs amongst different cancer groups, and we have observed a more positive trend for the few papers published in the last two years.

Pie chart graphic showing the percentage of GWAS with publicly available summary statistics, sorted by the main cancer types. There is a positive trend in the last couple of years towards increased data sharing.

In a recent Twitter poll, we asked the community to tell us what they thought was the most important reason for the low rate of cancer summary statistics submission.

The results suggest that the main barriers to the sharing of summary statistics are:

  1. results are usually embargoed for use in future research,
  2. data privacy issues (e.g. patient confidentiality agreement), and
  3. lack of awareness/knowledge on the appropriate data repository.

We believe it is necessary for the genetics community to have a more comprehensive and transparent discussion about such barriers, starting from the privacy issues. As highlighted in the recently revised NIH genomic data sharing (GDS) policy, summary statistics do not include individual-level information and can empower researchers to determine which genomic variants potentially contribute to a disease

How can we promote sharing?

As a part of a wider plan to identify and remove barriers to data sharing among different research communities, we have launched a dedicated public engagement campaign to motivate and encourage the genetics community to make their summary statistics publicly available – and make the difference in the field. In particular, following up from a pressing request from our users and stakeholders, we hope to understand the reasons why so few summary statistics are available for cancer studies. Open sharing of cancer GWAS data will lead the way to new, improved therapies and shed some light on the molecular mechanisms involved in such complex disease.

Quote from the text: Open sharing of cancer GWAS data will lead the way to new, improved therapies and shed some light on the molecular mechanisms involved in such complex disease.

We call on researchers, journals, funders and charities to support our cause and spread our message. The Nature Journals Group are already proactively supporting an open access policy for their published cancer GWAS data. We hope to get more journals on board, including the top cancer journals.

In an aim to facilitate data sharing and interoperability, the GWAS Catalog team has recently released a web-based deposition interface to support scalable author submission of summary statistics and metadata from published and pre-published (submitted at the time of journal submission upon request from the reviewers) GWAS. The datasets are submitted in a standard format and harmonised against the latest genome build and forward strand. This facilitates downstream analysis and integration into resources like the Open Targets Genetics.

Please submit your summary statistics to the GWAS Catalog.

What do you think are the most important barriers to the open sharing of summary statistics in cancer studies? Please let us know your thoughts by completing this short survey.

You can also reach out to us via Twitter (@GWASCatalog) or email (

Together, we can make the difference.

More on this topic

Expanding the scope of the GWAS Catalog
[4 minute read] Open Targets is investing in a new project to expand the scope of the GWAS Catalog to include targeted arrays, such as MetaboChip and ImmunoChip.