The Big (Data) Picture

May 18, 2015

by Benjamin Recchie

Andrey Rzhetsky, professor of medicine and human genetics, isn’t a computer scientist by trade. But the messy complexity of biomedicine is a problem that fairly cries out for analysis by computation. It was also the perfect springboard for him to discuss the overarching theme in his work in his talk for the Visualization Speaker Series, “Adventures in Analysis of Large Biomedical Datasets”: getting data for complex networks, combining data sets, and drawing from them some “non-obvious conclusions.”

One of the projects Rzhetsky discussed was an effort to classify the sense words—ones describing attributes like color, taste, or smell—used in scholarly works. By his estimation, there are roughly 1 trillion pages of scholarly work in existence, so investigating even a small subset of it required using computers to identify words in the six corpora his group studied. He found that, if sense words were life-forms, “scientific articles were like a desert,” mostly devoid of them, whereas literature was more like “a coral reef,” teeming with descriptions.

Rzhetsky noted that while the results were interesting, the real aim of the project was broader—training a robo-text evaluator “who doesn’t sleep or complain.” Such an evaluator could extend the human ability to make discoveries. “We are experiencing an explosion of innovation,” he said, but the price of that rapid innovation is unclear; asbestos and DDT, he points out, were hailed as major advances in their day until a better understanding of their costs led to their phasing out.

One major project Rzhetsky attempted along these lines was trying to use machine learning to extract relationships—such as “protein X binds to receptor Y”—from scientific papers to make previously undiscovered connections between genes and chemicals and their effects. He was able to tailor the program to avoid statements made in abstracts and conclusions, so as to avoid overweighting those statements, as well as to ignore hypothetical statements contrary to observations. His analysis has suggested a few such connections, and he’s working with experimentally minded colleagues to test them.

Another project examined US public health records for the number of birth malformations in newborn boys; Rzhetsky reasoned that birth defects could be used as a proxy for the environmental load of dangerous chemicals. Comparing this distribution with factors like ethnicity, class, and education, he showed that a one percent increase in malformation was correlated with a 283 percent increase in autism diagnoses. This was not as straightforward as it seems; Rzhetsky had to account not just for “known unknowns,” such as the rigor of autism diagnoses, and “unknown unknowns,” factors he didn’t even know to control for.

In the spirit of data visualization—the theme of the speaker series—Rzhetsky closed with a 3D flythrough he had made of a map of connections between genes known to cause health problems in both flies (malformed eyes) and humans (renal cancer). Finding that combination of genes by strictly statistical methods is nigh-impossible; a set of 10 genes could have 10³⁷possible combinations, far more than could ever be observed in nature, even if every human alive could be examined. Instead, he looked for network-connected combinations of genes, then tried to analyze which were linked together into a compact cluster.

He likened the resulting 3D flythrough to the William Gibson novel Neuromancer, in which characters move through 3D data sets to gain a better understanding of them, an idea which is now making the transition from science fiction to reality. “I didn’t have to do this,” he admitted, referring to the visualization, “but it’s kind of fun.”

Primary tabs