Priority and QoS

When submitting a job, it is required to specify the options --partition and --qos (quality of service). This choice affects the priority of your job, as well as restrictions on the quantity of resources you, a user, can request at one time.

In general, each group has a “main” partition and a “free-cycles” partition. The main partition will have some degree of limitation on the resources you can request. The free-cycles partition will have unlimited resources, but if a job in the corresponding main partition requires those resources, a job in the free-cycles partition will be preempted (killed).

If you use the free-cycles partition, it is recommended that you submit individual jobs with as little time and resources as necessary, to decrease the chance of preemption and increase the chance of completion. Another strategy is to write programs that are resilient to being killed by periodically saving state.

If you submit a batch job with the --requeue option, then it will be put back into the queue if preempted.

The partitions and QoSes available to you depend on your research group.

Available to all lab members

Partition

tig

QoS

tig-main - current limits: 4 gpus per user.

tig-free-cycles - Can be preempted by jobs in tig-main

Available to drl group

Partition

drl

QoS

drl-main - current limits: 8 gpus per user.

drl-free-cycles - Can be preempted by jobs in drl-main