Using Midway

Now that you are able to log in to Midway (Connecting to RCC Resources), upload and access your files on the cluster (Data Storage and Data Transfer), and load software tools using the module system (Software), you are ready for the next step: scheduling access to the RCC compute cluster to perform computations. This is the topic of this section of the RCC User Guide.

Overview

The Midway compute cluster is a shared resource used by the entire University of Chicago community. Sharing computational resources creates unique challenges:

  1. Jobs must be scheduled in a way that is fair to all users.
  2. Consumption of resources needs to be recorded.
  3. Access to resources needs to be controlled.

The Midway compute cluster uses a scheduler to manage requests for access to compute resources. These requests are called jobs. In particular, we use the Slurm resource manager to schedule jobs as well as interactive access to compute nodes.

Here, we give the essential information you need to know to start computing on Midway. For more detailed information on running specialized compute jobs, see Running jobs on midway.

Service Units and Allocations

Service Units (SUs) are a measure of the amount of computing resources consumed on a compute cluster. Computing resources in a compute cluster include processing units (also called CPUs or cores), memory, and Graphical Processing Units (GPUs). In standard settings, 1 SU equals usage of 1 processing unit for 1 hour, but the exact calculation will vary depending on the amount of memory requested, as well as additional factors like the use of GPUs and CPU architecture. The aim of the Service Unit (SU) is to provide a “fair” account of computing resources. For more information, please refer to the RCC Service Units webpage.

An “allocation” is a quantity of computing time (SUs) and storage resources that are granted to a group of users, usually a lab managed by a principal investigator (PI). Without an allocation, you cannot schedule and run jobs on the RCC compute cluster. For more information about SU allocations, see RCC Allocations.

Checking your account balance

The rcchelp tool can be used to check account balances. After logging into Midway, simply type:

$ rcchelp balance

If you are a member of multiple groups, this will display the allocations and usage for all your groups. The rcchelp balance command has a number of options for summarizing allocation usage. For information on these options, type

$ rcchelp balance --help

To see an overall summary of your usage, simply enter:

$ rcchelp usage

You can also get a more detailed breakdown of your usage by job using the --byjob option:

$ rcchelp usage --byjob

For more options available in the rcchelp tool, type

$ rcchelp --help

Types of Compute Nodes

The Midway compute cluster is made up of compute nodes with a variety architectures and configurations. A partition is a collection of compute nodes that all have the same, or similar, architecture and configuration. Currently, Midway has the following partitions:

Cluster Partition Compute cores (CPUs) Memory Other configuration details
midway broadwl 28 x Intel E5-2680v4 2.4GHz 64 GB EDR and FDR Infiniband interconnect
  broadwl-lc 28 x Intel E5-2680v4 @ 2.4 GHz 64 GB 10G Ethernet interconnect
  bigmem2 28 x Intel E5-2680v4 @ 2.4 GHz 512 GB FDR Infiniband interconnect
  gpu2 28 x Intel E5-2680v4 @ 2.4 GHz 64 GB 4 x Nvidia K80 GPU

You can also retrieve a summary of the partitions on Midway using the sinfo command:

$ rcchelp sinfo shared

In the rcchelp sinfo shared summary, the “NODES” column gives the total number of nodes in each partition. This summary also lists partitions that are reserved for use by certain labs.

Interactive Jobs

After submitting an “interactive job” on Midway, the Slurm job scheduler will connect you to a compute node, and will load up an interactive shell environment for you to use on that compute node. This interactive session will persist until you disconnect from the compute node, or until you reach the maximum requested time. The default requested time is 2 hours.

sinteractive

The command sinteractive is the recommended Slurm command for requesting an interactive session. As soon as the requested resources become available, sinteractive will do the following:

  1. Log in to the node.
  2. Change into the directory you were working in.
  3. Set up X11 forwarding for displaying graphics.
  4. Transfer your current shell environment, including any modules you have previously loaded.

To get started (with the default interactive settings), simply enter sinteractive in the command line:

$ sinteractive

By default, an interactive session times out after 2 hours. If you would like more than 2 hours, be sure to include a --time=HH:MM:SS flag to specify the necessary amount of time. For example, to request an interactive session for 6 hours, run the following command:

$ sinteractive --time=06:00:00

There are many additional options for the sinteractive command, including options to select the number of nodes, the number of cores per node, the amount of memory, and so on. For example, to request exclusive use of two compute nodes on the Midway broadwl partition for 8 hours, enter the following:

$ sinteractive --exclusive --partition=broadwl --nodes=2 --time=08:00:00

For more details about these and other useful options, read below about the sbatch command, and see Running jobs on midway. Note that all options available in the sbatch command are also available for the sinteractive command.

There is a debug QoS setup on the broadwl partition to help users quickly access some resources to debug or test their code before submitting their jobs to the main broadwl partition. The debug QoS will allow you to run one job and get up to 4 cores for 15 minutes. To use the debug QoS, you have to specify --time which should be 15 minutes or less. For example, to get 2 cores for 15 minutes, you could run:

$ sinteractive --qos=debug --time=00:15:00 --ntasks=2

srun

An alternative to the sinteractive command is the srun command:

$ srun --pty bash

Unlike sinteractive, this command does not set up X11 forwarding, which means you cannot display graphics using srun. Both the srun and sinteractive commands have the same command options.

Batch Jobs

The sbatch command is the command most commonly used by RCC users to request computing resources on the Midway cluster. Rather than specify all the options in the command line, users typically write an “sbatch script” that contains all the commands and parameters neccessary to run the program on the cluster.

In an sbatch script, all Slurm parameters are declared with #SBATCH, followed by additional definitions.

Here is an example of an sbatch script:

#!/bin/bash
#SBATCH --job-name=example_sbatch
#SBATCH --output=example_sbatch.out
#SBATCH --error=example_sbatch.err
#SBATCH --time=00:05:00
#SBATCH --partition=broadwl
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=14
#SBATCH --mem-per-cpu=2000

module load openmpi
mpirun ./hello-mpi

Here is an explanation of what each of these options does:

Option Description
#SBATCH --job-name=example_sbatch Assigns label example_sbatch to the job.
#SBATCH --output=example_sbatch.out Writes console output to file example_sbatch.out.
#SBATCH --error=example_sbatch.err Writes an error messages to file example_sbatch.err.
#SBATCH --time=00:05:00 Reserves the computing resources for 5 minutes (or less if program completes before 5 min).
#SBATCH --partition=broadwl Requests compute nodes from the broadwl partition on the Midway cluster.
#SBATCH --nodes=4 Requests 4 compute nodes.
#SBATCH --ntasks-per-node=14 Requests 14 cores (CPUs) per node, for a total of 14 * 4 = 56 cores.
#SBATCH --mem-per-cpu=2000 Requests 2000 MB (2 GB) of memory (RAM) per core, for a total of 2 * 14 = 28 GB per node.

In this example, we have requested 4 compute nodes with 14 CPUs each. Therefore, we have requested a total of 56 CPUs for running our program. The last two lines of the script load the OpenMPI module and launch the MPI-based executable that we have called hello-mpi (see MPI jobs).

Continuing the example above, suppose that this script is saved in the current directory into a file called example.sbatch. This script is submitted to the cluster using the following command:

$ sbatch ./example.sbatch

Many other options are available for submitting jobs using the sbatch command. For more specialized computational needs, see Running jobs on midway. Additionally, for a complete list of the available options, see the Official SBATCH Documentation.

Temporary File Storage

Many applications generate temporary or intermediate files that are written to /tmp. (These applications may write files to /tmp even without you being aware that this is happening.) This folder is typically on a local drive or the RAM disk that virtualized in the system memory.

Contents in /tmp left by a user’s job won’t be automatically purged before rebooting the corresponding node, and therefore may affect other jobs later running on the same node. Therefore, RCC enforces a data purge policy for files written to /tmp on compute nodes:

1. For each running job, a special “job-protected” folder /tmp/jobs/${SLURM_JOB_ID} is created on each allocated node. Its contents are safely purged only upon termination of the job (when it is sucessfully completed, canceled or killed).

2. For any running jobs, environment variables SLURM_TMPDIR and TMPDIR are set to /tmp/jobs/${SLURM_JOB_ID}. Whenever possible, users should write to the paths specified by these environment variables rather than using /tmp explicitly. (Most applications should already be using these environment variables by default, so in many cases this will not require any change to your code.)

3. In addition to using $TMPDIR, users should also verify that no additional files are being written to /tmp.

4. Note that upon termination of a job, any folders or files directly under /tmp that belong to the submitter of this job will be purged.

5. The contents of /tmp do not persist after jobs terminate. The RCC is not responsible for retrieving or recovering data stored there. For critical outputs, please save them to the persisent file storage systems; see Data Storage and Data Transfer.

Note

Folders or files created by users in /tmp outside $TMPDIR are NOT job-protected. For example, consider the case when user has two running jobs (A and B) on the same node, and job B is directly writing files in /tmp. If job A terminates before job B, the contents of /tmp will be also purged, and in some cases may cause job B to fail. To avoid failure, users should therefore write any temporary data to the job-protected folder, SLURM_TMPDIR or TMPDIR.

Managing Jobs

The Slurm job scheduler provides several command-line tools for checking on the status of your jobs and for managing them. For a complete list of Slurm commands, see the Slurm man pages. Here are a few commands that you may find particularly useful:

  • squeue: finds out the status of jobs submitted by you and other users.
  • sacct: retrieves job history and statistics about past jobs.
  • scancel: cancels jobs you have submitted.

In the next couple sections we explain how to use squeue to find out the status of your submitted jobs, and scancel to cancel jobs in the queue.

Checking your jobs

Use the squeue command to check on the status of your jobs, and other jobs running on Midway. The simplest invocation lists all jobs that are currently running or waiting in the job queue (“pending”), along with details about each job such as the job id and the number of nodes requested:

$ squeue

Any job with 0:00 under the TIME column is a job that is still waiting in the queue.

To view only the jobs that you have submitted, use the --user flag

$ squeue --user=$USER

This command has many other useful options for querying the status of the queue and getting information about individual jobs. For example, to get information about all jobs that are waiting to run on the bigmem2 partition, enter:

$ squeue --state=PENDING --partition=bigmem2

Alternatively, to get information about all your jobs that are running on the bigmem2 partition, type:

$ squeue --state=RUNNING --partition=bigmem2 --user=$USER

The last column of the output tells us which nodes are allocated for each job. For example, if it shows midway2-0172 for one of the jobs under your name, you may type ssh midway2-0172 to log in to that compute node and inspect the progress of your computation locally.

For more information, consult the command-line help by typing squeue --help, or visit the official online documentation.

Canceling your jobs

To cancel a job you have submitted, use the scancel command. This requires you to specify the id of the job you wish to cancel. For example, to cancel a job with id 8885128, do the following:

$ scancel 8885128

If you are unsure what is the id of the job you would like to cancel, see the JOBID column from running squeue --user=$USER.

To cancel all jobs you have submitted that are either running or waiting in the queue, enter the following:

$ scancel --user=$USER

Job Limits

To distribute computational resources fairly to all Midway users, the RCC sets limits on the amount of computing resources that may be requested by a single user at any given time.

The maximum run-time for an individual job is 36 hours. This applies to all batch and interactive jobs submitted to nodes in the general-access partitions (broadwl, broadwl-lc, bigmem2, and gpu2). Groups participating in the cluster parternership program may customize resources limits for their partitions.

Additional information on limits, such as the maximum number of CPUs that can be requested by a user at any one time, or the number of jobs that can be submitted concurrently on a given partition, can be found by entering the command rcchelp qos on any login or compute node on Midway. Observe that these limits are often different depending on the partition.

Usage limits may change, so rcchelp qos will always give you the most up-to-date information.

If your research requires a temporary exception to a particular limit, you may apply for a special allocation. Special allocations are evaluated on an individual basis and may or may not be granted.