September 29, 2014
by Darren Angle
The genetic code of living things has been likened to a blueprint for life. Unlike a real blueprint, though, the genome doesn’t explicitly lay out everything. Genomes can reveal the amino acid sequence of proteins, the molecules at the heart of biological functions like metabolizing food and fighting disease, but the truly valuable insights lie hidden in a protein's physical structure—a far more elusive piece of data.
The stakes are high. Proteins that fold incorrectly are now thought to play a critical role in diabetes, Alzheimer's, and cancer. If scientists can better understand how proteins fold, and why they “misfold,” their research could lead to revolutionary drugs. However, there’s no easy way to deduce the three-dimensional structure of a protein from a text-based amino acid sequence. The number of ways a protein might fold—the number of final structures even a single protein could achieve—is gargantuan.
Making accurate predictions presents a collective computing task so large that researchers are scrambling for computing power to do it. Limited by the wait times or low performance of departmental clusters, some teams borrow computing time from out-of-state supercomputers at different universities and labs. But a team led by Karl Freed, the Henry J. Gale Distinguished Professor Emeritus in chemistry, the Computation Institute, and the James Franck Institute, and Tobin Sosnick, chair of biochemistry and molecular biology, the Computation Institute, and the Institute for Biophysical Dynamics, has been taking advantage of the Midway Compute Cluster at the Research Computing Center.
To understand the 3-D structure of proteins, current analytical methods rely on comparing known protein structures to new sequences. This means that comparative, or “homology-based,” methods are limited by the number of known structures available for comparison—a number dwarfed by new data many times over. Aashish Adhikari, a former graduate student and postdoctoral scholar with the Freed/Sosnick team, developed a method to take an amino acid sequence as input and produce a 3D model of a protein with promising accuracy.
Adhikari's algorithm does not rely on homology, so it is not limited to solving protein structures by comparing them to closely related structures that have already been solved. Instead, he tailored it to harness high-performance computing in a way that let him tackle proteins for which no structural information exists.
In the first run of his simulations, the algorithm allows the protein backbone—the main chain in the molecule--to move, scoring the structures based on a group of physical potential energies. Next, it rules out physically improbable outcomes, and then uses the observed secondary structure elements to restrain the protein backbone movements in the second run of the simulations. After several iterative rounds, Adhikari’s algorithm can predict a protein's final structure knowing only its amino acid sequence. Any of these rounds constitute hundreds of trajectories that are only possible by the computational power of Midway. He can also describe the discrete steps it takes to fold, called “the pathway,” a problem usually solved separately.
Adhikari’s method marks an enormous improvement over the current state of the art. By comparison, a specially designed computer funded by billionaire/hobbyist David Shaw took 30 million CPU hours to determine the structure of a single protein. Adhikari used Midway to solve the same problem with comparable accuracy in just 600 CPU hours.
Another collaborator with the Freed team, Esmael Haddadian, lecturer in the Biological Sciences Collegiate Division, runs complex simulations through the RCC to study amyloid-beta, a protein linked to Alzheimer's disease. Determining how individual proteins agglomerate to form amyloid-beta fibrils is difficult to observe experimentally. But using Midway, Haddadian is able to determine the minimum number of monomer proteins needed to start the formation of an amyloid fiber. “If I tried to run this simulation on this computer,” says Haddadian, referring to his iMac, “it would take years.”
Haddadian has also taken advantage of the RCC to teach: he has used Midway to teach his students to model the dynamics of proteins in three dimensions realistically by means of molecular dynamics simulations. He followed that up by using the he’s used the Data Visualization Laboratory in the Kathleen A. Zar Room at the John Crerar Library to go over their trajectories in 3D and analyze the data.
There’s plenty more work to be done; efforts like the Human Genome Project, as well as projects to sequence the genomes of other species, have resulted in an exponential rise of yet-to-be-studied proteins, explains Freed.
“I believe computing and data-intensive science will drive the discoveries of our time,” says H. Birali Runesha, assistant vice president for research computing and director of the RCC. If Freed and Sosnick’s team's work is any example, this prediction may be coming true.