Queuing
The queuing system is the highest level feature of our cluster plans
and the level at which users will interact with the cluster.
Condor is a very mature queuing
system with great flexibility,
http://www.cs.wisc.edu/condor/ has more
detail than you're likely to want to know about it.
The queuing system is, by design, abstracted from the underlying
hardware, so it is possible to add the queuing feature to
a preexisting cluster regardless of the underlying architecture.
Condor provides robust per machine
user priorities and resource specification which will allow us to
provide three distinct levels of availability:
- General, in which everyone has equal priority
- Priority, in which everyone has access but priority users will preempt non-priority users
- Private, in which only specified users have access to the resource
The current plan is to support "General" access on systems purchased
and run by TIG, "Priority" on systems purchased by research groups but
maintained by TIG and "Private" on systems which are neither own nor
operated by TIG. Groups of course may opt to have more open access if
they feel particularly generous.
Users' jobs will be matched to the most specific available resource, so
if you have private and priority systems you will be matched in the
following order:
- your private resources
- your priority resources
- general resources
- resources on which others have priority
You'll note that when running as a non-priority user on someone else's
resource, or if you've accumulated a lot of hours on the cluster your job may be
preempted for a higher priority user.
Condor has methods for
preempting and restarting jobs and, under certain conditions, automated
check pointing so jobs that are preempted pickup where they left off.
More details on the specifics of using the queuing system will be
available in the
CondorIntro page
One of the central features of
Condor is it's ability to do cycle
harvesting on idle workstations, suspending or migrating jobs when the
workstations become active. The
Infolab group is currently doing preliminary testing of this configuration.
--
JonProulx - 14 Feb 2007