January 10, 2019
New GWAS method brings up some interesting results
By Rob Mitchum
In the period after the completion of the Human Genome Project, the genome-wide association study, or GWAS, promised to reveal the secrets of how genetics determine disease and other human phenotypes. The method was straightforward — compare a million or so common genetic variants against a given trait, such as baldness or schizophrenia, and see which variants are associated with a higher incidence of the trait, hopefully pointing the way to the specific genes that contribute to the condition.
But despite larger and larger studies, GWAS never quite delivered on its translational hype, due in part to the complexity of biology and the high statistical bar set by researchers. Very few gene variants were found to be strongly predictive of a phenotype, with most variants only mildly increasing the risk of the target disease or characteristic. Even when variants did pass the statistically significant line, it wasn’t always straightforward to determine their biological effect or even the specific gene they regulated. As a result, GWAS projects created as much frustration as new biological and clinical knowledge.
In a recent paper for Nature Communications, University of Chicago researchers working with the Research Computing Center describe and apply a new statistical method that helps extract more clues from already-published GWAS results. A new “enrichment analysis,” published by Xiang Zhu and Matthew Stephens, uses predetermined gene sets to identify new pathways and genes involved in diseases and traits such as Alzheimer’s and high cholesterol, revealing promising new research leads using only a small amount of the data from already-completed studies.
The work builds upon previous research published in 2013 by Stephens with Peter Carbonetto, now a computational specialist with the RCC. Instead of blindly testing one million independent variants, the enrichment analysis created by Carbonetto and Stephens tested sets of known gene variants grouped by their associated genes’ presence in specific molecular pathways or expression in a particular tissue, such as liver or brain. When these ensembles are tested against a target phenotype, associations pop up where individual variants would have been lost in the statistical noise, with occasionally surprising results.
“It was a very natural idea to go looking at the associations you found in GWAS and then ask, well, what do they share in common,” said Stephens, a Professor in the Departments of Statistics, Human Genetics, and the College at UChicago. “What we're really trying to do is go a bit beyond that, to dig a bit further down. Even where the statistical signal is not so strong that we're necessarily confident in each individual result, we expect that the gene sets rising to the top are going to be enriched for real biological things going on.”
However, the first iteration of the enrichment analysis required access to raw, individual-level results from the original GWAS — a major hurdle for studies that used data from as many as millions of people often spread across large multi-institutional consortia. In the new Nature Communications paper and a companion paper in The Annals of Applied Statistics, Zhu and Stephens describe a way to perform the analysis using only GWAS summary statistics, a much smaller data set easily acquired from virtually any published GWAS effort.
“To get your hands on raw data requires at the very least an applications process, and if the data are spread across 20 or 30 different groups, in practice there's no way to get it, because you would spend all your time filling in forms and then chasing them up, and getting the lawyers involved on contracts. It's a pain,” Stephens said. “But the summary data are often just downloadable on the internet, and anyone can apply this method to any case where they've got access to those broadly available data.”
To demonstrate the power of their analysis, Zhu and Stephens applied it to GWAS studies of 31 different human phenotypes, using over 4,000 gene sets. The method, reassuringly, returned some obvious results; for example, a gene set based on pathways involved in inflammatory bowel disease (IBD) was found highly associated with Crohn’s disease, a subset of IBD. But it also produced tantalizingly novel findings, such as an enrichment of the genes expressed in the liver for Alzheimer’s disease (a finding replicated in a Nature Genetics paper released this week).
“I don't think we know yet what the significance of this result ultimately is, but it seems potentially interesting,” Stephens said. “Hopefully, Alzheimer's researchers can dig into it and explain it for us.”
Meanwhile, their method (downloadable on Github) has already gained traction in the genetics research community, used by two large GWAS consortia, the Million Veteran Program and the Global Lipids Genetics Consortium.
Analyses for the paper were conducted using RCC resources, which allowed Zhu to refine and verify the method before applying it across all 31 traits. Even using the simplified summary statistics, the computational demands remain high, with a typical analysis of 1.1 million SNPs and 3913 gene sets taking 36 hours on 48 RCC nodes.
“The high-performance clusters and storage spaces at RCC allowed us to complete this large analytical task within a relatively short amount of time,” said Zhu, now a Stein Fellow at Stanford University. “The RCC team also provided excellent technical support while we were conducting this research on their clusters.”