.. index:: pair: RCC; User Guide .. _using-midway: =============================================================== Using Midway =============================================================== Now that you are able to log in to Midway (:ref:`connecting`), upload and access your files on the cluster (:ref:`data-storage` and :ref:`data-transfer`), and load software tools using the module system (:ref:`software`), you are ready for the next step: scheduling access to the RCC compute cluster to perform computations. This is the topic of this section of the RCC User Guide. .. contents:: :local: Overview ======== The Midway compute cluster is a shared resource used by the entire University of Chicago community. Sharing computational resources creates unique challenges: 1. Jobs must be scheduled in a way that is fair to all users. 2. Consumption of resources needs to be recorded. 3. Access to resources needs to be controlled. The Midway compute cluster uses a **scheduler** to manage requests for access to compute resources. These requests are called **jobs**. In particular, we use the Slurm_ resource manager to schedule jobs as well as interactive access to compute nodes. Here, we give the essential information you need to know to start computing on Midway. For more detailed information on running specialized compute jobs, see :ref:`running-jobs`. .. _service_units: Service Units and Allocations ============================= Service Units (SUs) are a measure of the amount of computing resources consumed on a compute cluster. Computing resources in a compute cluster include processing units (also called CPUs or cores), memory, and Graphical Processing Units (GPUs). In standard settings, 1 SU equals usage of 1 processing unit for 1 hour, but the exact calculation will vary depending on the amount of memory requested, as well as additional factors like the use of GPUs and CPU architecture. The aim of the Service Unit (SU) is to provide a "fair" account of computing resources. For more information, please refer to the `RCC Service Units`_ webpage. An "allocation" is a quantity of computing time (SUs) and storage resources that are granted to a group of users, usually a lab managed by a principal investigator (PI). Without an allocation, you cannot schedule and run jobs on the RCC compute cluster. For more information about SU allocations, see `RCC Allocations`_. .. _account_balance: Checking your account balance ============================= The :command:`rcchelp` tool can be used to check account balances. After logging into Midway, simply type: :: $ rcchelp balance If you are a member of multiple groups, this will display the allocations and usage for all your groups. The :command:`rcchelp balance` command has a number of options for summarizing allocation usage. For information on these options, type :: $ rcchelp balance --help To see an overall summary of your usage, simply enter: :: $ rcchelp usage You can also get a more detailed breakdown of your usage by job using the :option:`--byjob` option: :: $ rcchelp usage --byjob For more options available in the :command:`rcchelp` tool, type :: $ rcchelp --help .. _node_types: Types of Compute Nodes ====================== The Midway compute cluster is made up of compute nodes with a variety architectures and configurations. A **partition** is a collection of compute nodes that all have the same, or similar, architecture and configuration. Currently, Midway has the following partitions: +---------+-------------+---------------------------------+------------+--------------------------------------------+ | Cluster | Partition | Compute cores (CPUs) | Memory | Other configuration details | +=========+=============+=================================+============+============================================+ | midway | broadwl | 28 x Intel E5-2680v4 2.4GHz | 64 GB | EDR and FDR Infiniband interconnect | +---------+-------------+---------------------------------+------------+--------------------------------------------+ | | broadwl-lc | 28 x Intel E5-2680v4 @ 2.4 GHz | 64 GB | 10G Ethernet interconnect | +---------+-------------+---------------------------------+------------+--------------------------------------------+ | | bigmem2 | 28 x Intel E5-2680v4 @ 2.4 GHz | 512 GB | FDR Infiniband interconnect | +---------+-------------+---------------------------------+------------+--------------------------------------------+ | | gpu2 | 28 x Intel E5-2680v4 @ 2.4 GHz | 64 GB | 4 x Nvidia K80 GPU | +---------+-------------+---------------------------------+------------+--------------------------------------------+ You can also retrieve a summary of the partitions on Midway using the :command:`sinfo` command: :: $ rcchelp sinfo shared In the :command:`rcchelp sinfo shared` summary, the "NODES" column gives the total number of nodes in each partition. This summary also lists partitions that are reserved for use by certain labs. .. _interactive_jobs: Interactive Jobs ================ After submitting an "interactive job" on Midway, the Slurm job scheduler will connect you to a compute node, and will load up an interactive shell environment for you to use on that compute node. This interactive session will persist until you disconnect from the compute node, or until you reach the maximum requested time. The default requested time is 2 hours. sinteractive ------------ The command ``sinteractive`` is the recommended Slurm command for requesting an interactive session. As soon as the requested resources become available, sinteractive will do the following: 1. Log in to the node. 2. Change into the directory you were working in. 3. Set up X11 forwarding for displaying graphics. 4. Transfer your current shell environment, including any modules you have previously loaded. To get started (with the default interactive settings), simply enter ``sinteractive`` in the command line: :: $ sinteractive By default, an interactive session times out after 2 hours. If you would like more than 2 hours, be sure to include a ``--time=HH:MM:SS`` flag to specify the necessary amount of time. For example, to request an interactive session for 6 hours, run the following command: :: $ sinteractive --time=06:00:00 There are many additional options for the ``sinteractive`` command, including options to select the number of nodes, the number of cores per node, the amount of memory, and so on. For example, to request exclusive use of two compute nodes on the Midway **broadwl** partition for 8 hours, enter the following: :: $ sinteractive --exclusive --partition=broadwl --nodes=2 --time=08:00:00 For more details about these and other useful options, read below about the ``sbatch`` command, and see :ref:`running-jobs`. Note that all options available in the ``sbatch`` command are also available for the ``sinteractive`` command. There is a ``debug`` QoS setup on the ``broadwl`` partition to help users quickly access some resources to debug or test their code before submitting their jobs to the main ``broadwl`` partition. The ``debug`` QoS will allow you to run one job and get up to 4 cores for 15 minutes. To use the ``debug`` QoS, you have to specify ``--time`` which should be 15 minutes or less. For example, to get 2 cores for 15 minutes, you could run: :: $ sinteractive --qos=debug --time=00:15:00 --ntasks=2 srun ---- An alternative to the ``sinteractive`` command is the ``srun`` command: :: $ srun --pty bash Unlike ``sinteractive``, this command does not set up X11 forwarding, which means you cannot display graphics using ``srun``. Both the ``srun`` and ``sinteractive`` commands have the same command options. .. _batch_jobs: Batch Jobs ========== The ``sbatch`` command is the command most commonly used by RCC users to request computing resources on the Midway cluster. Rather than specify all the options in the command line, users typically write an "sbatch script" that contains all the commands and parameters neccessary to run the program on the cluster. In an sbatch script, all Slurm parameters are declared with ``#SBATCH``, followed by additional definitions. Here is an example of an sbatch script: .. code:: bash #!/bin/bash #SBATCH --job-name=example_sbatch #SBATCH --output=example_sbatch.out #SBATCH --error=example_sbatch.err #SBATCH --time=00:05:00 #SBATCH --partition=broadwl #SBATCH --nodes=4 #SBATCH --ntasks-per-node=14 #SBATCH --mem-per-cpu=2000 module load openmpi mpirun ./hello-mpi Here is an explanation of what each of these options does: +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | Option | Description | +=============================================+=========================================================================================================================+ | :code:`#SBATCH --job-name=example_sbatch` | Assigns label :code:`example_sbatch` to the job. | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | :code:`#SBATCH --output=example_sbatch.out` | Writes console output to file :code:`example_sbatch.out`. | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | :code:`#SBATCH --error=example_sbatch.err` | Writes an error messages to file :code:`example_sbatch.err`. | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | :code:`#SBATCH --time=00:05:00` | Reserves the computing resources for 5 minutes (or less if program completes before 5 min). | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | :code:`#SBATCH --partition=broadwl` | Requests compute nodes from the `broadwl` partition on the Midway cluster. | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | :code:`#SBATCH --nodes=4` | Requests 4 compute nodes. | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | :code:`#SBATCH --ntasks-per-node=14` | Requests 14 cores (CPUs) per node, for a total of 14 * 4 = 56 cores. | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ | :code:`#SBATCH --mem-per-cpu=2000` | Requests 2000 MB (2 GB) of memory (RAM) per core, for a total of 2 * 14 = 28 GB per node. | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ In this example, we have requested 4 compute nodes with 14 CPUs each. Therefore, we have requested a total of 56 CPUs for running our program. The last two lines of the script load the OpenMPI module and launch the MPI-based executable that we have called ``hello-mpi`` (see :ref:`MPI_jobs`). Continuing the example above, suppose that this script is saved in the current directory into a file called ``example.sbatch``. This script is submitted to the cluster using the following command: .. code:: bash $ sbatch ./example.sbatch Many other options are available for submitting jobs using the :command:`sbatch` command. For more specialized computational needs, see :ref:`running-jobs`. Additionally, for a complete list of the available options, see the `Official SBATCH Documentation`_. .. _temporary_folder: Temporary File Storage ====================== Many applications generate temporary or intermediate files that are written to ``/tmp``. (These applications may write files to ``/tmp`` even without you being aware that this is happening.) This folder is typically on a local drive or the RAM disk that virtualized in the system memory. Contents in ``/tmp`` left by a user's job won't be automatically purged before rebooting the corresponding node, and therefore may affect other jobs later running on the same node. Therefore, RCC enforces a data purge policy for files written to ``/tmp`` on compute nodes: 1. For each running job, a special "job-protected" folder ``/tmp/jobs/${SLURM_JOB_ID}`` is created on each allocated node. Its contents are safely purged only upon termination of the job (when it is sucessfully completed, canceled or killed). 2. For any running jobs, environment variables ``SLURM_TMPDIR`` and ``TMPDIR`` are set to ``/tmp/jobs/${SLURM_JOB_ID}``. Whenever possible, users should write to the paths specified by these environment variables rather than using ``/tmp`` explicitly. (Most applications should already be using these environment variables by default, so in many cases this will not require any change to your code.) 3. In addition to using ``$TMPDIR``, users should also verify that no additional files are being written to ``/tmp``. 4. Note that upon termination of a job, any folders or files directly under ``/tmp`` that belong to the submitter of this job will be purged. 5. The contents of ``/tmp`` do not persist after jobs terminate. The RCC is not responsible for retrieving or recovering data stored there. For critical outputs, please save them to the persisent file storage systems; see :ref:`data-storage` and :ref:`data-transfer`. .. note:: Folders or files created by users in ``/tmp`` outside ``$TMPDIR`` are NOT job-protected. For example, consider the case when user has two running jobs (A and B) on the same node, and job B is directly writing files in ``/tmp``. If job A terminates before job B, the contents of ``/tmp`` will be also purged, and in some cases may cause job B to fail. To avoid failure, users should therefore write any temporary data to the job-protected folder, ``SLURM_TMPDIR`` or ``TMPDIR``. .. _managing_jobs: Managing Jobs ============= The Slurm job scheduler provides several command-line tools for checking on the status of your jobs and for managing them. For a complete list of Slurm commands, see the `Slurm man pages `_. Here are a few commands that you may find particularly useful: * :command:`squeue`: finds out the status of jobs submitted by you and other users. * :command:`sacct`: retrieves job history and statistics about past jobs. * :command:`scancel`: cancels jobs you have submitted. In the next couple sections we explain how to use :command:`squeue` to find out the status of your submitted jobs, and :command:`scancel` to cancel jobs in the queue. Checking your jobs ------------------ Use the ``squeue`` command to check on the status of your jobs, and other jobs running on Midway. The simplest invocation lists all jobs that are currently running or waiting in the job queue ("pending"), along with details about each job such as the job id and the number of nodes requested: :: $ squeue Any job with ``0:00`` under the ``TIME`` column is a job that is still waiting in the queue. To view only the jobs that you have submitted, use the ``--user`` flag :: $ squeue --user=$USER This command has many other useful options for querying the status of the queue and getting information about individual jobs. For example, to get information about all jobs that are waiting to run on the ``bigmem2`` partition, enter: :: $ squeue --state=PENDING --partition=bigmem2 Alternatively, to get information about all your jobs that are running on the ``bigmem2`` partition, type: :: $ squeue --state=RUNNING --partition=bigmem2 --user=$USER The last column of the output tells us which nodes are allocated for each job. For example, if it shows ``midway2-0172`` for one of the jobs under your name, you may type ``ssh midway2-0172`` to log in to that compute node and inspect the progress of your computation locally. For more information, consult the command-line help by typing ``squeue --help``, or visit the `official online documentation `_. Canceling your jobs ------------------- To cancel a job you have submitted, use the ``scancel`` command. This requires you to specify the id of the job you wish to cancel. For example, to cancel a job with id ``8885128``, do the following: :: $ scancel 8885128 If you are unsure what is the id of the job you would like to cancel, see the `JOBID` column from running ``squeue --user=$USER``. To cancel **all** jobs you have submitted that are either running or waiting in the queue, enter the following: :: $ scancel --user=$USER .. _job_limits: Job Limits ========== To distribute computational resources fairly to all Midway users, the RCC sets limits on the amount of computing resources that may be requested by a single user at any given time. The maximum run-time for an individual job is 36 hours. This applies to all batch and interactive jobs submitted to nodes in the general-access partitions (broadwl, broadwl-lc, bigmem2, and gpu2). Groups participating in the `cluster parternership program `_ may customize resources limits for their partitions. Additional information on limits, such as the maximum number of CPUs that can be requested by a user at any one time, or the number of jobs that can be submitted concurrently on a given partition, can be found by entering the command ``rcchelp qos`` on any login or compute node on Midway. Observe that these limits are often different depending on the partition. Usage limits may change, so ``rcchelp qos`` will always give you the most up-to-date information. If your research requires a temporary exception to a particular limit, you may apply for a `special allocation `_. Special allocations are evaluated on an individual basis and may or may not be granted.