Condor is a sophisticated batch and queuing system running on the Nemo cluster. Using Condor, LSC scientists can manage and monitor the submission and execution of many, many independent jobs on the cluster.
Condor jobs on the cluster are intended to be "stand-alone" executables, though it is possible to manage some MPI jobs using Condor (send mail to the Nemo administrators for more details).
We have chosen Condor as a batch and queing system on the cluster because
Ideal codes for running on Hydra under Condor are codes which
The above are recommendations; Condor will effectively manage many different types of codes and jobs including everything from a simple shell script to the most complex analysis, but if at all possible codes should be designed to meet the recommendations above.
In addition it is strongly preferred that codes running on the Nemo cluster run in the Condor "standard" universe (see the Condor manual for definition and details) so that checkpointing is available and the cluster resources are used as efficiently as possible. To run in the standard universe codes must meet the following restrictions:
0000000000000000000
[user@hydra]$ cat hello.c
#include <stdlib.h>
#include <stdio.h>
int main(void){
int myNumber;
int err;
err = fscanf(stdin, "%d", &myNumber);
if (err != 1){
fprintf(stderr, "Couldn\'t find my number!\\n");
}
fprintf(stdout, "My number is %d\\n", myNumber);
return 0;
}
[user@hydra]$ gcc -c hello.c
[user@hydra temp]$ condor_compile gcc hello.o -o hello LINKING FOR CONDOR : /usr/bin/ld -Bstatic -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o hello /home/condor/medusa-installation/lib/condor_rt0.o /usr/lib/crti.o /usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66/crtbegin.o -L/home/condor/medusa-installation/lib -L/usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66 -L/usr/i386-redhat-linux/lib hello.o /home/condor/medusa-installation/lib/libcondorzsyscall.a /home/condor/medusa-installation/lib/libz.a -lgcc -L/home/condor/medusa-installation/lib -lc -lnss_files -lnss_dns -lresolv -lc -lgcc /usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66/crtend.o /usr/lib/crtn.o /home/condor/medusa-installation/lib/libcondorc++support.a
[user@hydra]$ echo "1" | ./hello Condor: Notice: Will checkpoint to ./hello.ckpt Condor: Notice: Remote system calls disabled. My number is 1
[user@hydra]$ ls in.* in.0 in.18 in.27 in.36 in.45 in.54 in.63 in.72 in.81 in.90 in.1 in.19 in.28 in.37 in.46 in.55 in.64 in.73 in.82 in.91 in.10 in.2 in.29 in.38 in.47 in.56 in.65 in.74 in.83 in.92 in.11 in.20 in.3 in.39 in.48 in.57 in.66 in.75 in.84 in.93 in.12 in.21 in.30 in.4 in.49 in.58 in.67 in.76 in.85 in.94 in.13 in.22 in.31 in.40 in.5 in.59 in.68 in.77 in.86 in.95 in.14 in.23 in.32 in.41 in.50 in.6 in.69 in.78 in.87 in.96 in.15 in.24 in.33 in.42 in.51 in.60 in.7 in.79 in.88 in.97 in.16 in.25 in.34 in.43 in.52 in.61 in.70 in.8 in.89 in.98 in.17 in.26 in.35 in.44 in.53 in.62 in.71 in.80 in.9 in.99 [user@hydra]$ cat in.45 45
You submit jobs to Condor using a Condor submit script. The submit script contains a list of Condor commands and assigned values of the form
The most often used commands are executable, universe, input, output, error, log, and queue. For a complete list of commands see the manual page for condor_submit.
A submit script for the hello.c job outlined above might look like this:
[user@hydra]$ cat hello.sub universe = standard executable = hello input = in.$(Process) output = out.$(Process) error = err.$(Process) log = log.$(Process) queue 100
When you submit this file to Condor it will queue up 100 instances of your executable hello.
Note the use of the $(Process) macro. Condor will substitute "0" for the first job it queues up, "1" for the second job, and so on. Thus the file in.0 will be used as the stdin for the first job, out.0 as stdout for the first job, and err.0 for the first job.
To submit your jobs to Condor using your prepared submit file use the condor_submit command:
[user@hydra temp]$ condor_submit hello.sub Submitting job(s)............................................................ ........................................ 100 job(s) submitted to cluster 1.Top
You can check the Condor queue to see the status of your jobs using the condor_q command:
[user@hydra]$ condor_q -- Submitter: hydra.phys.uwm.edu : <129.89.201.232:35435> : hydra.phys.uwm.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16.19 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.20 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.21 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.22 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.24 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.27 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.30 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.35 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.37 someuser 1/21 16:53 0+00:00:01 R 0 2.7 hello 16.40 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.45 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.48 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.55 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.60 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.64 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.70 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.71 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.72 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.73 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.75 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.79 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.80 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.91 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.93 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.95 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.96 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.97 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.98 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 16.99 someuser 1/21 16:53 0+00:00:00 R 0 2.7 hello 29 jobs; 0 idle, 29 running, 0 heldTop
You can check the status of the cluster nodes using the condor_nemo or condor_status command:
[user@hydra]$ condor_nemo
Name State Activity LoadAv
hydra.phys.uwm.e Unclaimed Idle 0.020000
nemo-slave0001. Unclaimed Idle 0.050000
nemo-slave0002. Unclaimed Idle 0.180000
nemo-slave0003. Unclaimed Idle 0.040000
nemo-slave0004. Owner Idle 0.530000
nemo-slave0005. Unclaimed Idle 0.000000
nemo-slave0006. Unclaimed Idle 0.000000
...
nemo-slave0291. Unclaimed Idle 0.000000
nemo-slave0292. Unclaimed Idle 0.000000
nemo-slave0293. Unclaimed Idle 0.000000
nemo-slave0294. Unclaimed Idle 0.000000
nemo-slave0295. Unclaimed Idle 0.000000
nemo-slave0296. Unclaimed Idle 0.000000
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 297 13 0 284 0 0
Total 297 13 0 284 0 0
A node in the "Unclaimed" state is ready to run Condor jobs. A node in the "Owner" state is currently being used by jobs outside of Condor\'s control (possibly LDAS) and will not accept Condor jobs at this time. A node in the "Claimed" state has been matched with a Condor job.
TopFrom section 2.11 of the Condor Manual:
A directed acyclic graph (DAG) can be used to represent a set of programs where the input, output, or execution of one or more programs is dependent on one or more other programs.
In other words, DAGs can be used to represent an analysis pipeline, and Condor DAGman can be used to ensure that the jobs making up the DAG are executed with the correct dependencies. Please see the link above for more details.
Condor users running DAGman jobs on Nemo should use a directory under /people on the login node hydra.phys.uwm.edu for the log files that DAGman monitors as part of the DAG. This is necessary since file locking via Linux NSF is not robust. The /people filesystem is local to the login node.
Please send email to nemo-admin@gravity.phys.uwm.edu if you need a directory in /people created for you.
TopUse the notification command in your Condor submit description file. See the documentation for the condor_submit command.
There are any number of reasons why jobs that are queued may not start:
Use the arguments command in your Condor submit description file. See the documentation for the condor_submit command.
Use the environment command in your Condor submit description file. See the documentation for the condor_submit command.
No. If each instance of your program reads the same input from stdin then you can simply set
input = myInput
in your submit file to have each program read from the file myInput.
The Condor system includes the notion of jobs "flocking" from one Condor pool (such as Nemo) to another pool (perhaps at another LIGO Tier II site). This enables one to exploit as many resources (CPUs) as are available with no extra effort. At this time, however, flocking can only happen between pools (clusters) which do not live behind a firewall since the "submit machine" (the machine from which jobs are submitted) must be able to communicate directly with the machine on which a job runs.
The UWM-LSC group is helping test a development version of Condor which will enable flocking to proceed even if one or both of the machines involved in a Condor matchmaking are behind a firewall. When in place this version of Condor will allow jobs to flock from Nemo to other pools automatically, increasing job throughput.
In addition, this new version of Condor will allow LSC scientists to submit Condor jobs to the UWM-LSC Nemo cluster directly from their workstations using the Condor-G client. Hence LSC scientists will be able to develop codes and run them in production all from their desktop workstations.
Top