Using The Condor Queuing System

Please feel free to add content here as you see fit.

For the gory details see the Condor site at UWisc: http://www.cs.wisc.edu/condor/

External Links:

Table of Contents

DistributedComputingWithMATLAB is totally busted for Condor cluster work (it still works on single systems in with multiple cores), the required submitfcn throws errors on 2012a and the last version it worked on now has issues with the current system C++ libraries

Basic Steps

NOTE: Jobs must be run from NFS (/data/scratch for example) not AFS, see details below.

  1. Add condor to your .software file in your AFS home directory, then log out and log back in for it to take effect.
    echo condor >> ~/.software
  2. Login to a submission node (note you must have valid Kerberos ticket to submit jobs). Condor submission nodes are currently:
    • borg-login-1.csail.mit.edu
    • borg-login-2.csail.mit.edu
    • borg-login-3.csail.mit.edu
    • borg-login-4.csail.mit.edu
    • EXCEPTIONS:
      • if your workstation is a member of the cluster you may submit directly from your workstation
      • if your are part of the Infolab group and are submitting a job that requires isInfolab==True, you should submit from an Infolab system
      • if your group has a private "sub-cluster" where group members have priority you may have different submit nodes as well.
  3. Submit your job using
    condor_submit <submit file>

Due to the preemptive scheduling model used by Condor, your job may start running and then be stopped and requeued. Therefore it is important not to make any assumptions about the state of files that may be modified by your job. In practice this isn't usually and issue and just means you should clear any output files before writing to them rather than assuming they are empty or nonexistent. More detailed information follows

Basic Job Submission

This is my version of "hello world" for condor. It needs to be run from local disk or NFS space as Condor does not play well with AFS. I keep copies in my home directory and then copy them to /data/scratch/ for the run then collect the data back to my home dir after the run. See the dire warnings in /data/scratch/README and don't leave things there you can't live without.

The executable for this is "echo.sh", which is simply:

#!/bin/sh

echo $HOSTNAME

Make sure that the executable bit. (e.g. "chmod +x echo.sh")

The submit file (arbitrarily called echo.submit) is where the action happens (detailed docs in condor manual, or look at our partial listing of CondorSubmitVariables):

###standard condor headers for CSAIL###

# preserve your environment variables
GetEnv = True

# use the plain nothing special universe
Universe = vanilla

# only send email if there's an error 
Notification = Error

# Allows you to run on different "filesystem domains" 
#by copying the files around if needed
should_transfer_files = IF_NEEDED
WhenToTransferOutput = ON_EXIT

###END HEADER###

###job specific bits###
Executable = echo.sh
#Arguments =
# queue log (doesn't like to be on NFS due to locking needs) 
Log = /tmp/echo.$ENV(USER).log

#What to do with stdin,stdout,stderr
# $(PROCESS) is replaced by the sequential
# run number (zero based) of this submission
# see "queue" below
#Input = input.$(PROCESS)
Error = err.$(PROCESS)
Output = out.$(PROCESS)

# how many copies of this job to queue
queue 3
####END job  specific bits###

running this job with condor_submit echo.submit will put three files called out.<0,1,2> in the current directory each containing the name of the system the job ran on.

-- JonProulx - 14 Feb 2007

Using both 32 bit and 64 bit systems

Currently (Nov 2007) there are 56 64bit Linux nodes and 16 32bit linux nodes, as more workstations join the cluster these numbers are likely to fluctuate.

By default jobs will only run on systems with the same operating system (OpSys?) and architecture (Arch) as the node you submit from. This makes sense when considering binary executables will only run on the specific operating system and architecture they are compiled for.

It is possible run the same code across different operating systems and architectures either by submitting a script that determines the system it's executing on and calls the right executables or by specifying different executables using the $$(OpSys?) and/or $$(Arch) submit file macros as part of the executable name or path. We'll look at examples of each.

architecture independent script

The echo.submit script above is architecture independent, it doesn't matter if a 32bit or 64bit version is /bin/sh is called. In this case simply over riding the implicit requirements statement is enough to get it to run on both 32bit and 64bit systems regardless of which architecture it was submitted on. To do this we add one line to the top of the echo.submit:

Requirements = Arch == "INTEL" || Arch == "X86_64"

per architecture executables

For this example we will have two versions of a helloworld program. One compiled for 64bit systems and one for 32bits. We will name teh 32bit vesrion "helloworld.INTEL" and the 64bit version "helloworld.X86_64". These extensions match the values of $$(ARCH).

To use this in a submit file, we add the requirements line as in our last example to say which architectures we will run on:

Requirements = Arch == "INTEL" || Arch == "X86_64"

The other change is in the "Executable" line of the submit file:

Executable = helloworld.$$(ARCH)

everything else follows the same pattern as the basic submission format we used in the original example. Though the only operating system in our cluster is Linux, similar things could be done with the $$((OpSys?)) macro if we had multiple operating systems.

Basic Job Monitoring

  1. To check the status of all queued jobs use
    condor_q -global
  2. To check the status of jobs submitted from the current machine use
    condor_q 
  3. Either of the above can be limited to your jobs by appending $USER to the command line
  4. Adding
    -better-analyze <job number>
    to the command line will help determine why your job isn't running
  5. View the overall cluster activity at http://condor-view.csail.mit.edu
-- Main.rshetty - 18 Apr 2007

Killing Running Jobs

  1. To check the list of currently running jobs and find the job ID, use
    condor_q
  2. To end a particular job, use
    condor_rm <ID>
  3. To end ALL of your own jobs that are currently in the queue, use
    condor_rm <yourusername>

See the Condor manual for details.

-- Main.rshetty - 18 Apr 2007

Dealing with Preemption

In maintain fair access to resources users with high recent usage, or guest users on private resources, may have their jobs preempted for higher priority users. The method allows opportunistic use of spare compute cycles on other groups resources and prevents a group of very long running jobs from monopolizing available resources. With out this if someone submitted a large number of long jobs to the cluster when it was idle or nearly so all the jobs would need to complete before any other jobs could start. Some jobs have run for more than one week, and others could be longer, so this could significantly undermine the usefulness of the cluster as a shared resource without some means of inserting other jobs before these complete.

Current Requirements for Preemption:
  • Job has been running for > 1hr
  • AND Queued job has at least 50% greater priority than running job
  • OR running job has set 'NiceUser = True'

Relinking your code with the Condor libraries and using the "Standard Universe" your job will be checkpointed so it can restart where it left off. Most people are using the "Vanilla Universe" (largely because you can't relink Matlab), so you will have to start over unless your job does it's own check pointing and can figure out where it left off.

Here is what happens when your job is preempted:

  • If you're running as a guest and there are no other resources available for the resource owner's job to go your job will have 5min of wall clock time to complete on it's own. You can decrease this by setting MaxJobRetirmentTime? (in seconds) in your submit file, you may want to set this to 0 if you think it unlikely your job will complete in 5min.

  • If it's still running after that period it will get SIGTERM, usually this kills your job but you can trap that signal to do interesting things if you like.

  • 10sec after that if your job is still there it will be killed hardwith SIGKILL.

  • If your job is killed before it completes it will be requeued.

Condor is configured to preempt jobs with the shortest runtime (all else being equal as user priority, and machine rank are also considered ) so the least forward progress is lost.
Topic revision: 16 Nov 2012, JonProulx
 

MIT Computer Science and Artificial Intelligence Laboratory

 

  • About CSAIL
  • Research
  • News + Events
  • Resources
  • People

This site is powered by Foswiki MIT: Massachusetts Institute of Technology