Hosted Data

The Research Computing Center (RCC) supports data-intensive research at the University of Chicago by providing centrally managed storage resources for hosting large research data sets. By making commonly used research data sets accessible through a centralized storage system, the RCC is able to provide researchers with the data they need without the overhead of storing and managing the repository themselves. Because the RCC stores research data sets in the same high-performance storage environment used by the Midway compute cluster, data and computational tools are tightly coupled, thereby allowing for efficient analysis routines.

The RCC is able to host open-access as well as proprietary data sets on a case-by-case basis. Currently, the RCC hosts the following data sets:

The Community Earth Science Model (CESM) datasets and models
Over 150,000 protein structures from the Protein DataBank (PDB)
Genome Taxonomy Database Toolkit (GTDB-Tk) reference database derived from the Genome Taxonomy Database with standardized taxonomy and reference trees for bacteria and archaea
The Nielsen Scanner and Panel full data sets
The IRI Marketing dataset
Consumer-level data linked to LPS McDash loan-level data for improved risk management (CRISM-McDash)
CoreLogic Loan-Level Market Analytics data
All Natural Language ToolKit (NLTK) corpora, models, and training sets
The full Corpus of Contemporary American English (COCA) and Corpus of Historical American English (COHA) datasets

To request the hosting of an additional dataset, or for information on how to access these resources, please contact the RCC.

Primary tabs