Priority and QoS
When submitting a job, it is required to specify the options --partition
and --qos
(quality of service). This choice affects the priority of your job, as well as restrictions on the quantity of resources you, a user, can request at one time.
In general, each group has a “main” partition and a “free-cycles” partition. The main
partition will have some degree of limitation on the resources you can request. The free-cycles
partition will have unlimited resources, but if a job in the corresponding main
partition requires those resources, a job in the free-cycles
partition will be preempted (killed).
If you use the free-cycles partition, it is recommended that you submit individual jobs with as little time and resources as necessary, to decrease the chance of preemption and increase the chance of completion. Another strategy is to write programs that are resilient to being killed by periodically saving state.
If you submit a batch job with the --requeue
option, then it will be put back into the queue if preempted.
The partitions and QoSes available to you depend on your research group.
Available to all lab members
Partition
tig
QoS
tig-main - current limits: 4 gpus per user.
tig-free-cycles - Can be preempted by jobs in tig-main
Available to drl
group
Partition
drl
QoS
drl-main - current limits: 8 gpus per user.
drl-free-cycles - Can be preempted by jobs in drl-main