Research Data Management for the 99%

March 13, 2013

by Rob Mitchum, Computation Institute

Big science projects can afford big cyberinfrastructure. For example, the Large Hadron Collider at CERN in Geneva generates 15 petabytes of data a year, but also boasts a sophisticated data management infrastructure for the movement, sharing and analysis of that gargantuan data flow. But big data is no longer an exclusive problem for these massive collaborations in particle physics, astronomy and climate modeling. Individual researchers, faced with new laboratory equipment and methods that can generate their own torrents of data, increasingly need their own data management tools, but lack the hefty budget large projects can dedicate to such tasks. What can the 99% of researchers doing big science in small labs do with their data?

That was how Computation Institute director Ian Foster framed the mission at hand for the Research Data Management Implementations Workshop, happening today and tomorrow in Arlington, VA. The workshop was designed to help researchers, collaborations and campuses deal with the growing need for high-performance data transfer, storage, curation and analysis -- while avoiding wasteful redundancy.

"The lack of a broader solution or methodology has led basically to a culture of one-off implementation solutions, where each institution is trying to solve their problem their way, where we don't even talk to each other, where we are basically reinventing the wheel every day," said H. Birali Runesha, director of the University of Chicago Research Computing Center, in his opening remarks.

The subsequent talks offered some promising options for how the data needs of the research populace can be served without falling into these traps. For a service to succeed, it must be frictionless, affordable and sustainable, Foster said. The best currently existing models for this combination are everyday services such as Netflix, Gmail or Flickr, which offer people instant access to movies, e-mail or photographs without having to install any software (other than a browser) on their computer. These examples of "software-as-a-service," or SaaS, rely upon intuitive user interface and cloud-hosted infrastructure to produce "an ease of use that leaves scientific tools in the dust," Foster said.

The motivation behind Globus Online is to merge that SaaS convenience with the ability to securely move and perform other tasks on large amounts of scientific data -- not gigabytes, but terabytes, Foster said. Unlike cloud storage services such as Dropbox, Globus Online doesn't store data, but instead facilitates its movement from one endpoint to another, helping scientists move data from a sequencing center or supercomputer facility back to their own laboratory servers, for instance. In its first few years of existence, Globus Online has signed up 8,000 registered users, who have moved over 10 terabytes of data and roughly one billion files. The service is now expanding beyond data transfer to more complex tasks, such as data sharing, analysis and archiving, that are normally handled at high expense by local campus IT services.

"Our vision for how research data management may be provided going forward is that storage and computing will sit in many different places within campuses," Foster said. "But a large fraction of the research data management functions that make those raw infrastructure resources usable can be provided more cost-effectively, efficiently and in a better manner by outsourcing them to software-as-a-service providers."

But Richard Moore of the San Diego Supercomputer Center talked about some of the cultural hurdles facing widespread acceptance in science of these exciting new tools. In late 2011, Moore and SDSC launched the largest academic cloud server in the United States, with 5.5 petabytes of storage and 8 gigabyte per second transfer speeds. The intention was to create easy access to data sharing and storage for individual researchers and collaborations, Moore said. But in the first year and change of the cloud's existence, the SDSC team found that the primary users were instead large-scale and commercial users.

When they conducted a survey of individual researchers to determine why they weren't using the service, they found that researchers still preferred using Dropbox (despite campus policy forbidding it) or over-the-counter USB drives that could provide similar storage amounts for a cheaper price. Data sharing, curation and long-term preservation were also not considered priorities by a majority of researchers, reflecting a scientific culture that has yet to shift towards demanding the advanced data management services that are now appearing.

"[Data sharing] is not a priority," Moore said. "For the 99 percent, the smaller labs, it's a cultural thing. People follow their motivations. As a domain researcher, will a terrific data management plan and a demonstrated history of successful data sharing through easy-to-use portals get you your next grant? That's where the funding agencies, when you start mandating and upping the importance of that in the review process will be important."

See original article at: https://ci.uchicago.edu/blog/research-data-management-99

Primary tabs