Slurm Quickstart Guide
Prerequisites:
A CSAIL account
Which nodes you can access depends on research group membership.
Data Storage
For most users, the best choice is an NFS filesystem.
If you do not have a dedicated filesystem, you may use a temporary scratch filesystem.
Job requirements
To run a Slurm job, you must specify the following things:
Memory
specified in megabytes
CPU
specified in number of cores
Partition
the group of systems that will be eligible to run your job
all CSAIL members have access to partition “tig” and the associated qos “tig-main”
QoS
the “quality of service”, which determines job priority, resource limits, and pre-emptability
Some QoSes allow you to run jobs on idle nodes belonging to other groups, with the caveat that your job may be pre-empted, i.e. killed, if the system owner needs the job.
Time
the maximum that your job will run
Different partitions have different maximum time limits, typically between 24 hours and 7 days.
Submitting your job
First, connect to the Slurm login server with ssh slurm-login.csail.mit.edu
. All jobs must be submitted from a login server.
One way to submit a job is with the tool sbatch
. This allows you to run multiple commands in one job, and if you disconnect from the login node
Copy this script to my_slurm_example.sbatch
:
#!/bin/bash
#
#SBATCH --job-name=my_very_own_job
#SBATCH --partition=tig
#SBATCH --qos=tig-main
#SBATCH --time=00:50:00 # 5 hours
#SBATCH --output=/data/scratch/$USER/job_output_%j.log
#SBATCH --error=/data/scratch/$USER/job_error_%j.log
#SBATCH --gpus=0
#SBATCH --cpus-per-task=1
#SBATCH --mem=4000M
echo "My Job ID is $SLURM_JOB_ID"
echo "The time is $(date)"
echo "This job is running on $(hostname)"
python -c "print('Hello, world!')"
Now run it with: sbatch my_slurm_example.sbatch
.
The output of your job will be written to the paths specified with –output and –error. In this example, it’s within the /data/scratch filesystem. If this directory does not yet exist, you may create it within /data/scratch.
Interactive jobs
Because a Slurm job’s environment has a few differences from non-Slurm systems, it can be useful to run commands interactively to confirm how your commands behave and build your sbatch script.
Here’s an example srun
command to quickly launch a single process:
srun --partition=tig --qos=tig-main --time=5:00 --cpus-per-task=1 --mem=4000M python -c "print('Hello, world!')"
The output of this command will print to your terminal.
You can also have an interactive shell by adding the --pty
flag
srun --partition=tig --qos=tig-main --time=5:00 --cpus-per-task=1 --mem=4000M --pty /bin/bash