Slurm Quickstart Guide

Prerequisites:

A CSAIL account

Which nodes you can access depends on research group membership.

Data Storage

For most users, the best choice is an NFS filesystem.

If you do not have a dedicated filesystem, you may use a temporary scratch filesystem.

See: https://tig.csail.mit.edu/data-storage/nfs/

Job requirements

To run a Slurm job, you must specify the following things:

Memory

specified in megabytes

CPU

specified in number of cores

Partition

the group of systems that will be eligible to run your job

all CSAIL members have access to partition “tig” and the associated qos “tig-main”

QoS

the “quality of service”, which determines job priority, resource limits, and pre-emptability

Some QoSes allow you to run jobs on idle nodes belonging to other groups, with the caveat that your job may be pre-empted, i.e. killed, if the system owner needs the job.

Time

the maximum that your job will run

Different partitions have different maximum time limits, typically between 24 hours and 7 days.

Submitting your job

First, connect to the Slurm login server with ssh slurm-login.csail.mit.edu. All jobs must be submitted from a login server.

One way to submit a job is with the tool sbatch. This allows you to run multiple commands in one job, and if you disconnect from the login node

Copy this script to my_slurm_example.sbatch:

#!/bin/bash
#
#SBATCH --job-name=my_very_own_job 
#SBATCH --partition=tig
#SBATCH --qos=tig-main
#SBATCH --time=00:50:00 # 5 hours
#SBATCH --output=/data/scratch/$USER/job_output_%j.log
#SBATCH --error=/data/scratch/$USER/job_error_%j.log
#SBATCH --gpus=0
#SBATCH --cpus-per-task=1
#SBATCH --mem=4000M

echo "My Job ID is $SLURM_JOB_ID"
echo "The time is $(date)"
echo "This job is running on $(hostname)"

python -c "print('Hello, world!')"

Now run it with: sbatch my_slurm_example.sbatch.

The output of your job will be written to the paths specified with –output and –error. In this example, it’s within the /data/scratch filesystem. If this directory does not yet exist, you may create it within /data/scratch.

Interactive jobs

Because a Slurm job’s environment has a few differences from non-Slurm systems, it can be useful to run commands interactively to confirm how your commands behave and build your sbatch script.

Here’s an example srun command to quickly launch a single process:

srun --partition=tig --qos=tig-main --time=5:00 --cpus-per-task=1 --mem=4000M python -c "print('Hello, world!')"

The output of this command will print to your terminal.

You can also have an interactive shell by adding the --pty flag

srun --partition=tig --qos=tig-main --time=5:00 --cpus-per-task=1 --mem=4000M --pty /bin/bash