General Information
Configuring and Deploying Condor C
This page is meant to document efforts at tuning CondorC for the best performance on the LIGO Data Grid.
First test and numbers from P. Armor
submit fileexecutable script
Date: Mon, 25 Jun 2007 08:48:31 -0500 (CDT) From: Paul ArmorHi Scott, here's a summary of the test I mentioned to you, that I'd run last week. Basically, with the same users clobbering the cluster, I submitted the same vanilla test job (hostname ; sleep 180); I submit with the sleep of 180 to try to get as many as possible running at the same time, and I submit 1600 to try to get as many of the 780 machines as possible to run on at least one of the 2 cores. In both cases, my jobs ran as user parmor, who's prio was 1 (this was reflected in condor_userprio, and all other users having prio of 10). It took MUCH longer for the condor-c jobs to run; only ~50 at a time running, and I/O back seemed "spurty", and slow (it seemed as though C-GAHP would cache up a bunch of results (the output of hostname), then do them at some threshold (either time or qty?) or something?). In summary, it took the same jobs submitted via marlin ~9 minutes, and via ldg-portal ~161 minutes? ---------- Forwarded message ---------- Date: Wed, 20 Jun 2007 15:05:53 -0500 (CDT) ldg-portal file access times: logs range from 10:57 - 13:36 1600 jobs, logs updated with total of 42 uniq hour:minute timestamps (updated in "clumps" of 26, 74, 29, 21, 58, 21, 16, 62, 27, 76, 24, 74, 2, 24, 20, 60, 20, 18, 63, 19, 81, 19, 68, 12, 22, 26, 50, 24, 20, 60, 20, 76, 24, 75, 25, 20, 60, 20, 20, 60, 20, 64) 1600 jobs, outputs generated with a total of 39 uniq hour:minute timestamps (updated in "clumps" of 3, 23, 74, 29, 79, 21, 78, 27, 50, 26, 24, 21, 55, 24, 80, 20, 81, 19, 50, 31, 19, 80, 22, 76, 24, 80, 20, 50, 26, 24, 9, 66, 25, 80, 20, 80, 20, 50, 14) condor_q on marlin showed many jobs in "C" at any time, may show jobs in "H" while staging files (input or output? Not sure), no more than 50 ever running concurrently, some samples (no "C" included): 17 jobs; 9 idle, 3 running, 5 held 58 jobs; 8 idle, 50 running, 0 held 19 jobs; 0 idle, 19 running, 0 held 11 jobs; 0 idle, 11 running, 0 held 40 jobs; 5 idle, 31 running, 4 held 57 jobs; 0 idle, 47 running, 10 held 5 jobs; 0 idle, 0 running, 5 held condor_userprio showed user parmor as running, my prio is 1, all others are 10 marlin file access times: logs range from 14:02 - 14:11 26, 96, 172, 181, 189, 207, 214, 257, 257, 1 outputs range from 14:00 - 14:08 40, 106, 221, 119, 194, 191, 214, 257, 258
Suggestions from A. DeSmet
Date: Tue, 3 Jul 2007 17:54:21 -0500 From: Alan De Smet <xxxxxxxx@cs.wisc.edu> Subject: Re: Next Condor/LIGO phone call on Monday (July 2) < 2) Initial performance numbers from condor-c efforts at UWM. < Discussion of what enhancements are planned/needed to Condor-C to < help solve automatic job distribution accross different pools. Per the call yesterday and Scott's reports of initial performance numbers looking bad, the two most important settings to change are GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE (GMPSPR) and C_GAHP_CONTACT_SCHEDD_DELAY (CGCSD). For all practical purposes, you'll only get GMPSPR jobs transferred from your initial submit machine to the second schedd every 2xCGCSD seconds. the defaults are GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=5 C_GAHP_CONTACT_SCHEDD_DELAY=20 So you'll only get 5 jobs every 40 seconds, which will explain most of the slowdown you're seeing. The second setting (CGCSD) is how often the gridmanager (via the GAHP) is willing to bother the remote schedd. Every 20 seconds seems a reasonable default to me, but if you're interested in optimizing for responsiveness, you can make it more aggressive at the cost of keeping the remote schedd more busy. The first setting is how many jobs can be in the process of being transfered over. Since we won't know that a job has been accepted until the next poll in CGCSD*, this is a big problem. I'd suggest making it much larger. I suggest 100 as a starting point. *As mention above, actually 2 CGCSD intervals, as there is some back-and-forth going on. -- Alan De Smet Condor Project Research xxxxxxxx@cs.wisc.edu http://www.condorproject.org/
Simple test by S. Koranda after A. DeSmet suggestions
I edited the local config file /opt/condor/home/condor_config.log on ldg-portal.phys.uwm.edu (the stand-alone condor pool) and added
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=100 C_GAHP_CONTACT_SCHEDD_DELAY=5
To compare apples-to-apples I then ran the exact same suite of jobs that P. Armor ran. The jobs are simple--they call hostname and sleep for 3 minutes.
The first job (of any) was submitted to a "grid" resource at 09:24:58 as indicated in the job's log file. The last of (of all) was terminated at 10:51:07. So the total throughput took about 86 minutes. This is definitely an improvement.
Note that during the test no more then 100 jobs at any given time ran on the cluster. Often there would be 100 running, then these would "drain" and be replaced by another 100.
Output from ps auwwwwwx | grep condor on ldg-portal.phys.uwm.edu while jobs being serviced:
skoranda 7944 1.5 3.2 47964 30820 ? S 09:24 0:23 condor_gridmanager -f -C (Owner=?="skoranda"&&JobUniverse==9) -S /tmp/condor_g_scratch.0xb0ec90.7928 skoranda 7946 0.0 0.3 20020 3720 ? S 09:24 0:00 /opt/condor/sbin/condor_c-gahp -f -s marlin.phys.uwm.edu -P condorcm.phys.uwm.edu skoranda 7947 1.9 0.4 20256 3960 ? S 09:24 0:29 /opt/condor/sbin/condor_c-gahp_worker_thread -f -s marlin.phys.uwm.edu -P condorcm.phys.uwm.edu skoranda 7948 2.2 0.4 20576 4252 ? S 09:24 0:33 /opt/condor/sbin/condor_c-gahp_worker_thread -f -s marlin.phys.uwm.edu -P condorcm.phys.uwm.edu root 10160 0.0 0.0 55932 708 pts/0 S+ 09:49 0:00 grep condor
$Id: tuning01.html,v 1.4 2007/07/16 20:55:19 skoranda Exp $