Basic Usage / Quick Start

What is Slurm?

Slurm is a cluster scheduler used to share compute resources in a managed queue.

Our cluster consists of:

At the time of this writing, all but one of these nodes are hosted on openstack. The systems vcuda-[0-4] each have PCI pass-through access to eight Titan XP cards (12G video memory), and have been assigned 16 virtual CPU cores (on the hypervisors Xeon E5) and 128G memory. Nodes groenig-[0-15] are also virtual with 16 virtual cores and 64G memory. tig-k80-0 is our physical host with eight Tesla K80s (12G video memory), 32 cores of Xeon E5, and 128G of memory.

Getting Started

Prerequisites

You will need a CSAIL membership and credentials.

Your program or script must be located in an NFS filesystem. See NFS documentation for information about requesting a filesystem. (Note: AFS tokens cannot be used within the cluster, and this includes your CSAIL home directory)

To run a job:

Data placed in /home will not be accessible across the cluster, and may cause your program to behave unexpectedly.

Writing an sbatch script

An sbatch script contains:

Here is a simple example.

% cat hello.sbatch

#!/bin/bash
#SBATCH --time=00:00:10
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=128
srun echo 'Hello, world!'

Each ‘srun’ task is submitted as a job to the cluster, and its output is written to a file named after that job.

% sbatch hello.sbatch
Submitted batch job 127314

% cat slurm-127314.out
Hello, world!

Submit multiple jobs

We can request an array of jobs with sbatch.

% cat dice.sbatch

#!/bin/bash
#SBATCH --time=00:00:10
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=128
srun echo $(( $RANDOM % 6 + 1))

Each sbatch directive applies to all srun commands in the script.

% sbatch --array=0-5 dice.sbatch
Submitted batch job 129942


% ls slurm-129942* && cat slurm-129942*
slurm-129942_0.out  slurm-129942_1.out  slurm-129942_2.out  slurm-129942_3.out  slurm-129942_4.out  slurm-129942_5.out
1
6
4
3
2
6

Use the array index

A task in an array can access its index from an environment variable.

cat nlog.sbatch

#!/bin/bash
#SBATCH --time=00:00:10
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=12
srun python -c 'import math,os; \
  my_index = int(os.getenv("SLURM_ARRAY_TASK_ID"));\
  print(math.log( my_index ))'

% sbatch --array=1-8 nlog.sbatch

Submitted batch job 128222

% cat slurm-128222_*
0.0
0.69314718056
1.09861228867
1.38629436112
1.60943791243
1.79175946923
1.94591014906
2.07944154168

Request GPU resources

If you require use of a GPU, submit a job to the “gpu” partition and request gpus with –gres (as in “generic resource”)

#$!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1

srun nvidia_smi -q

For a specific GPU type, add it to the GRES directive like: –gres=gpu:titan:1

% sbatch gpu.sbatch
Submitted batch job 129931

% head slurm-129931.out

==============NVSMI LOG==============

Timestamp                           : Tue Jun  2 09:00:00 2020
Driver Version                      : 430.50
CUDA Version                        : 10.1

Attached GPUs                       : 8
GPU 00000000:04:00.0
    Product Name                    : Tesla K80

Additional options

For advanced usage of sbatch, please refer to https://slurm.schedmd.com/sbatch.html

Interactive shell

For debugging, it can be useful to open an interactive shell on the cluster.

srun --pty --cpus-per-task=1 --mem-per-cpu 2000M --time=00:45:00 /bin/bash

Job Priority / QoS

When a job is submitted without a –qos option, the default QoS will limit the resources you can claim. Current limits can be seen on the login banner at tig-slurm.csail.mit.edu

This quota can be bypassed by setting the –qos=low. This is useful when the cluster is mostly idle, and you would like to make use of available resources beyond your quota. However, if these resources are required for someone else’s job, your job may be terminated.

Your job and the queue

sinfo

sinfo provides basic information about available cluster resources.

    % sinfo

    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    cpu          up 4-00:00:00      2   idle groenig-[0-2]
    gpu*         up 4-00:00:00      1   resv vcuda-4
    gpu*         up 4-00:00:00      1    mix vcuda-0
    gpu*         up 4-00:00:00      3   idle vcuda-[1-3]

Additional details about a node are available with ‘scontrol show node’.

% scontrol show node groenig-0
NodeName=groenig-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=groenig-0 NodeHostName=groenig-0 Version=17.11
   OS=Linux 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019
   RealMemory=55000 AllocMem=0 FreeMem=58731 Sockets=16 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cpu
   BootTime=2019-12-05T17:19:47 SlurmdStartTime=2020-05-31T06:25:10
   CfgTRES=cpu=16,mem=55000M,billing=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

squeue

You may list the contents of the queue with squeue.

% squeue

    JOBID  PARTITION     NAME     USER    ST       TIME  NODES NODELIST(REASON)
    129959       cpu     bash     erin     R       0:05      1 groenig-0

scontrol show job

Detailed information about your job is available with ‘scontrol show job’.

% scontrol show job 129959

   JobId=129959 JobName=bash
   UserId=erin(23372) GroupId=erin(23372) MCS_label=N/A
   Priority=1 Nice=0 Account=csail QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:01:40 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2020-06-02T12:12:45 EligibleTime=2020-06-02T12:12:45
   StartTime=2020-06-02T12:12:45 EndTime=2020-06-03T12:12:45 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-06-02T12:12:45
   Partition=cpu AllocNode:Sid=slurm-control-0:29786
   Partition=cpu AllocNode:Sid=slurm-control-0:29786
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=groenig-0
   BatchHost=groenig-0
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=2000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/home/erin
   Power=

scancel

You can use scancel terminate your job early, for example:

% scancel -v 129960
scancel: Terminating job 129960