Slurm Concepts

Node

A node is a computer configured to run jobs.

Job

A job is composed of a request, an allocation, and one or more tasks. Use sbatch, srun, or salloc to submit a job.

Task

A task is a running process within a job. In the simplest case, a job has one task.

Resource

A resource is one or more CPUs or GPUs, some amount of memory, or any other computing resource that Slurm knows about.

Allocation

An allocation is a set of resources with some constraints specified at the time of submission. If you submit a job with srun or sbatch, it waits in the queue for its allocation, then runs when that allocation is ready.

use salloc to obtain an allocation and interactively run tasks within that allocation

Partition

A partition is a group of nodes. A partition must be selected when submitting a job.

use sinfo to see the status of available systems and partitions

Which partitions you can access depends on your research group.

QoS

A QoS (quality of service) is a set of resource restrictions combined with a priority.

Which QoS’s you can access depends on your research group.

Lower priority QoSes generally allow more resources. This means you can use a QoS with low priority to use resources that are idle, with the caveat that your jobs may be preempted.

Priority

A job submitted in a higher priority QoS can preempt jobs with a lower priority to guarantee scheduling.

Preemption

When a job is preempted by a higher priority job, its tasks receive a SIGTERM, followed shortly by a SIGKILL.

With sbatch and the flag --requeue, a job that is preempted can automatically start over when the resources are again available.

Low priority jobs are more likely to be successful if their requested time is short.

Scheduler

The Slurm scheduler periodically walks the queue and attempts to schedule all waiting jobs. If a block of time sufficient to run the job is available, the job will be allocated and run. If multiple jobs are eligible to be scheduled, it takes into account factors including job age, requested run time, and job size.

To improve the chances of your job being scheduled as soon as possible, request a shorter run time with --time