June 8, 2016
by Benjamin Recchie
Katherine Riley, Director of Science for the Argonne Leadership Computing Facility (ALCF) came to campus May 11 as part of the Research Computing Center’s Collaborations in Computation speaker series, to encourage UChicago researchers to make use of ALCF’s resources. After all, she pointed out, not only is Argonne just 22 miles away, but it’s also a direct descendant of UChicago’s participation in the Manhattan Project. And besides the attraction of ALCF’s Mira supercomputer, she said, ALCF has much more to offer them.
The Department of Energy has the largest collection of computing power under one umbrella in the world. Some of that power is in computers reserved for the National Nuclear Security Administration, for classified work on nuclear weapons. But DOE also maintains user facilities—open to any qualified researcher, in academia or in private industry, at ALCF and its peers at the National Energy Research Scientific Computing Center (NERSC) and Oak Ridge National Laboratory (OLCF). (The ALCF and OLCF facilities are each built with a different architecture, to provide a diversity of platforms and manage risk.) The power of ALCF and the other user facilities isn’t just their raw power, Riley said, but rather what comes with it: a high-capacity network, data visualization clusters, and “experts on getting every last FLOP out of the system.” ALCF also provides training on getting started on using their machines and webinars on best practices.
Time on the machines at ALCF—the Mira supercomputer, as well as the Cetus and Vesta computers, used for experimental and developmental projects—is made via three programs. INCITE (Innovative and Novel Computational Impact on Theory and Experiment), which covers “flagship” projects which wouldn’t be possible anywhere else; ASCR Leadership Computing Challenge (ALCC), which covers projects that are central to the DOE’s own mission; and Director’s Discretionary, a catchall for developmental projects that will later compete for the other two categories, training exercises, proof of concept simulations, and the like. ALCF also has the Early Science Program, which grants time to users on the newest deployed systems while the facility shake it down, and the Data Science Program, which targets big data problems.
The traditional use model for supercomputing, said Riley, was that a researcher would run a simulation generate data and bring it back to their home institution. While many users still do that, others are using supercomputers in new ways, such as using the machines for complex workflows (like machine learning) or streaming experimental data. Riley shared some highlights of research done at ALCF, such as simulating combustion in an engine and studying superconductivity. All were consequential projects, but none were “traditional.”
It’s not just the ways researchers use the computers that are changing, either. “We’re seeing changes at all levels of HPC,” Riley said. Node capabilities are growing swiftly, but node interconnects are not. Memory architecture is changing for the first time in decades—future nodes architecture will have memory hierarchies to deliver the needed capacity for growing scientific demands. Meanwhile, common programming languages like Fortran and C++ and models like MPI were designed for another era, when FLOPs, not bandwidth, were scarce, and power was something controlled by operations, not as part of the software. New methods and approaches need to be developed, and outmoded paradigms need to be updated.
Adapting older methods and technology to the exascale will be hard, she admits, but ALCF is working hard to make sure applications can evolve and make that transition. “I promise you they can, and they will,” she said. “I absolutely promise you.”