Running codes with Condor
- Useful links
- Condor in a nutshell
- Best practices for codes
- Preparing your job for Condor
- Submitting your job
- Checking the queue
- Checking status of the nodes
- Using Condor DAGman and DAGs on the Nemo cluster
- FAQ
- Future directions
Useful links
Top
Condor in a nutshell
Condor is a sophisticated batch and queuing system running on the Nemo cluster.
Using Condor, LSC scientists can manage and monitor the submission and
execution of many, many independent jobs on the cluster.
Condor jobs on the cluster are intended to be "stand-alone" executables, though
it is possible to manage some MPI jobs using Condor (send mail to the
Nemo administrators for
more details).
We have chosen Condor as a batch and queing system on the cluster because
- Condor plays nicely and will "vacate" a node if it is being used for other
purpose, such as running LDAS jobs.
- Condor can manage tens of thousands of jobs.
- Condor will not "lose" a job. If a node or even the whole cluster goes
down Condor will restart your job (from a recent checkpoint if available).
- Condor checkpoints your jobs now and then so that if a node must
be vacated or for some reason the node goes down your work is not lost.
- Condor has a simple, command-line user interface.
Top
Best practices for codes
Ideal codes for running on Hydra under Condor are codes which
- run for a substantial amount of time ( ~> 1 hour )
- take parameters on the command-line or via stdin
- output mostly to stdout or stderr
- have little I/O
- require no interaction once started
The above are recommendations; Condor will effectively manage
many different types of codes and jobs including everything from
a simple shell script to the most complex analysis,
but if at all possible codes should be designed to meet the
recommendations above.
In addition it is strongly preferred that codes running on the Nemo cluster
run in the Condor "standard" universe (see the
Condor
manual for definition and details) so that checkpointing is available and
the cluster resources are used as efficiently as possible. To run in the
standard universe codes must meet the following restrictions:
- Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().
- Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.
- Network communication must be brief. A job may make network connections using system calls such as socket(), but a
network connection left open for long periods will delay checkpointing and migration.
- Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use.
Sending or receiving all other signals is allowed.
- Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and
sleep().
- Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
- Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
- File locks are allowed, but not retained between checkpoints.
- All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble
if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and
writing will result in a warning but not an error.
- Your executable must be statically linked.
Top
Preparing your job for Condor
- Once you have your fully tested and completely bug free analysis code running
optimally on your desktop Linux workstation use scp to transfer your
source files to your home directory on hydra.phys.uwm.edu.
0000000000000000000 - Build your code as you would normally:
[user@hydra]$ cat hello.c
#include <stdlib.h>
#include <stdio.h>
int main(void){
int myNumber;
int err;
err = fscanf(stdin, "%d", &myNumber);
if (err != 1){
fprintf(stderr, "Couldn\'t find my number!\\n");
}
fprintf(stdout, "My number is %d\\n", myNumber);
return 0;
}
[user@hydra]$ gcc -c hello.c
- Next link your code using condor_compile in front of the compiler gcc
(it is not necessary to
re-compile):
[user@hydra temp]$ condor_compile gcc hello.o -o hello
LINKING FOR CONDOR : /usr/bin/ld -Bstatic -m elf_i386 -dynamic-linker
/lib/ld-linux.so.2 -o hello
/home/condor/medusa-installation/lib/condor_rt0.o
/usr/lib/crti.o
/usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66/crtbegin.o
-L/home/condor/medusa-installation/lib
-L/usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66
-L/usr/i386-redhat-linux/lib hello.o
/home/condor/medusa-installation/lib/libcondorzsyscall.a
/home/condor/medusa-installation/lib/libz.a
-lgcc -L/home/condor/medusa-installation/lib -lc -lnss_files
-lnss_dns -lresolv -lc -lgcc
/usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66/crtend.o
/usr/lib/crtn.o
/home/condor/medusa-installation/lib/libcondorc++support.a
- Run your code to verify that it built correctly. Your code will automatically
detect that it is not being run under Condor and will print two simple
warning messages:
[user@hydra]$ echo "1" | ./hello
Condor: Notice: Will checkpoint to ./hello.ckpt
Condor: Notice: Remote system calls disabled.
My number is 1
- If your jobs require seperate inputs (that is, a seperate input to
stdin for each instance of your code) then make seperate files for
each job containing the necessary input (it is handy to use a python
or perl script for this). Number the files in a systematic
way:
[user@hydra]$ ls in.*
in.0 in.18 in.27 in.36 in.45 in.54 in.63 in.72 in.81 in.90
in.1 in.19 in.28 in.37 in.46 in.55 in.64 in.73 in.82 in.91
in.10 in.2 in.29 in.38 in.47 in.56 in.65 in.74 in.83 in.92
in.11 in.20 in.3 in.39 in.48 in.57 in.66 in.75 in.84 in.93
in.12 in.21 in.30 in.4 in.49 in.58 in.67 in.76 in.85 in.94
in.13 in.22 in.31 in.40 in.5 in.59 in.68 in.77 in.86 in.95
in.14 in.23 in.32 in.41 in.50 in.6 in.69 in.78 in.87 in.96
in.15 in.24 in.33 in.42 in.51 in.60 in.7 in.79 in.88 in.97
in.16 in.25 in.34 in.43 in.52 in.61 in.70 in.8 in.89 in.98
in.17 in.26 in.35 in.44 in.53 in.62 in.71 in.80 in.9 in.99
[user@hydra]$ cat in.45
45
Top
Submitting your job
You submit jobs to Condor using a Condor submit script. The submit script
contains a list of Condor commands and assigned values of the form
command = value
The most often used commands are executable, universe, input, output, error, log, and queue.
For a complete list of commands see the
manual page for
condor_submit.
A submit script for the hello.c job outlined above might look like this:
[user@hydra]$ cat hello.sub
universe = standard
executable = hello
input = in.$(Process)
output = out.$(Process)
error = err.$(Process)
log = log.$(Process)
queue 100
When you submit this file to Condor it will queue up 100 instances of your
executable hello.
Note the use of the $(Process) macro. Condor will substitute
"0" for the first job it queues up, "1" for the second job, and so on. Thus
the file in.0 will be used as the stdin for the first job,
out.0 as stdout for the first job, and err.0 for
the first job.
To submit your jobs to Condor using your prepared submit file
use the condor_submit command:
[user@hydra temp]$ condor_submit hello.sub
Submitting job(s)............................................................
........................................
100 job(s) submitted to cluster 1.
Top
Checking the queue
You can check the Condor queue to see the status of your jobs using the
condor_q command:
[user@hydra]$ condor_q
-- Submitter: hydra.phys.uwm.edu : <129.89.201.232:35435> : hydra.phys.uwm.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
16.19 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.20 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.21 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.22 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.24 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.27 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.30 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.35 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.37 someuser 1/21 16:53 0+00:00:01 R 0 2.7 hello
16.40 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.45 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.48 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.55 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.60 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.64 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.70 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.71 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.72 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.73 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.75 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.79 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.80 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.91 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.93 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.95 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.96 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.97 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.98 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
16.99 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello
29 jobs; 0 idle, 29 running, 0 held
Top
Checking status of the nodes
You can check the status of the cluster nodes using the condor_nemo
or condor_status command:
[user@hydra]$ condor_nemo
Name State Activity LoadAv
hydra.phys.uwm.e Unclaimed Idle 0.020000
nemo-slave0001. Unclaimed Idle 0.050000
nemo-slave0002. Unclaimed Idle 0.180000
nemo-slave0003. Unclaimed Idle 0.040000
nemo-slave0004. Owner Idle 0.530000
nemo-slave0005. Unclaimed Idle 0.000000
nemo-slave0006. Unclaimed Idle 0.000000
...
nemo-slave0291. Unclaimed Idle 0.000000
nemo-slave0292. Unclaimed Idle 0.000000
nemo-slave0293. Unclaimed Idle 0.000000
nemo-slave0294. Unclaimed Idle 0.000000
nemo-slave0295. Unclaimed Idle 0.000000
nemo-slave0296. Unclaimed Idle 0.000000
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 297 13 0 284 0 0
Total 297 13 0 284 0 0
A node in the "Unclaimed" state is ready to run Condor jobs. A node in
the "Owner" state is currently being used by jobs outside of Condor\'s
control (possibly LDAS) and will not accept Condor jobs at this time.
A node in the "Claimed" state has been matched with a Condor job.
Top
Using Condor DAGman and DAGs on the Nemo cluster
From section
2.11 of the Condor Manual:
A directed acyclic graph (DAG) can be used to represent a set of programs where the input, output, or execution of
one or more programs is dependent on one or more other programs.
In other words, DAGs can be used to represent an analysis pipeline, and Condor DAGman can be used to ensure that
the jobs making up the DAG are executed with the correct dependencies. Please see the link above for more details.
Condor users running DAGman jobs on Nemo should use a directory under /people
on the login node hydra.phys.uwm.edu for the log files that DAGman monitors as part of
the DAG. This is necessary since file locking via Linux NSF is not robust. The /people
filesystem is local to the login node.
Please send email to nemo-admin@gravity.phys.uwm.edu
if you need a directory in /people created for you.
Top
FAQ
- Condor is sending me a lot of mail. How can I stop that?
Use the notification command in your Condor submit description file.
See the documentation for
the condor_submit
command.
- My jobs are in the queue but they will not start. Why?
There are any number of reasons why jobs that are queued may not
start:
- There may not be any resources (ie. nodes) currently available.
Try running condor_q -analyze for details about why your jobs
are not running.
- Your jobs may have unusual or impossible requirments. Condor uses
the concepts of Matchmaking
and ClassAds to match jobs with requirements to resources that can meet said
requirements. Try running condor_q -long jobid for one of your jobs
to see the full set of requirments being advertised for your jobs, and look for anything
that you might not have intended.
- How do I pass unique command-line arguments to each job?
Use the arguments command in your Condor submit description file.
See the documentation for
the condor_submit
command.
- My program needs certain environment variables to be set. How do
I get Condor to do that?
Use the environment command in your Condor submit description file.
See the documentation for the condor_submit
command.
- Do I really need a seperate input file for each instance of my code?
No. If each instance of your program reads the same input from stdin then you
can simply set
input = myInput
in your submit file to have each program read from the file myInput.
Top
Future directions
The Condor system includes the notion of jobs "flocking" from one Condor pool (such as Nemo)
to another pool (perhaps at another LIGO Tier II site). This enables one to exploit as many
resources (CPUs) as are available with no extra effort. At this time, however, flocking can
only happen between pools (clusters) which do not live behind a firewall since the "submit machine"
(the machine from which jobs are submitted) must be able to communicate directly with
the machine on which a job runs.
The UWM-LSC group is helping test a development version of Condor which will enable
flocking to proceed even if one or both of the machines involved in a Condor matchmaking
are behind a firewall. When in place this version of Condor will allow jobs to flock from
Nemo to other pools automatically, increasing job throughput.
In addition, this new version of Condor will allow LSC scientists to submit Condor
jobs to the UWM-LSC Nemo cluster
directly from their workstations using the
Condor-G client. Hence LSC scientists
will be able to develop codes and run them in production all from their desktop workstations.
Top
$Id: running_codes_with_condor.html,v 1.6 2006/08/28 21:12:37 gskelton Exp $
|