LSC Data Grid (6 sources) Load

Navigation

General Information
LSC LIGO Scientific Collaboration
LIGO-Caltech
LIGO-Hanford Observatory
LIGO-Livingston Observatory

DataGrid Details

What is LSC DataGrid?
LDG Clusters Usage [Ganglia]
Available Data per site
Grid Service Details [Monitoring]

User Manual

How to get started
Install Data Grid Client
Getting Certificates
Account Request
SSH Login Portal
CVS/Bug Account Request
Request Software changes to SCCB

Admin Manual [(*) = optional]

Install DataGrid Server
Get server certificates
Configure/deploy Condor
Include site into Grid Monitoring
Graceful Condor shutdown
(*) Configure/deploy CondorView
(*) Configure Condor Flocking
(*) CondorC on LDG
LAMS / VOMS Admin [LSC internal]
Syracuse X4500 wiki [passwd required]
Edit these web pages

Request/Bug Tracking

Request Tracking System [RT]
LDG trouble ticket system

Policy

Reference O/S Schedule

LDG Collaborations

Condor-LIGO biweekly telecon
Globus-LIGO monthly telecon
LIGO VO in Open Science Grid [OSG]
Archival GriPhyN-LIGO WG pages

Exits

OSG

General Information

Configuring and Deploying Condor C

This page is meant to document efforts at tuning CondorC for the best performance on the LIGO Data Grid.


First test and numbers from P. Armor

submit file
executable script

Date: Mon, 25 Jun 2007 08:48:31 -0500 (CDT)
From: Paul Armor 

Hi Scott,
here's a summary of the test I mentioned to you, that I'd run last week. 
Basically, with the same users clobbering the cluster, I submitted the 
same vanilla test job (hostname ; sleep 180); I submit with the sleep of 
180 to try to get as many as possible running at the same time, and I 
submit 1600 to try to get as many of the 780 machines as possible to run 
on at least one of the 2 cores.  In both cases, my jobs ran as user 
parmor, who's prio was 1 (this was reflected in condor_userprio, 
and all other users having prio of 10).  It took MUCH longer for the 
condor-c jobs to run; only ~50 at a time running, and I/O back seemed 
"spurty", and slow (it seemed as though C-GAHP would cache up a bunch of 
results (the output of hostname), then do them at some threshold (either 
time or qty?) or something?).

In summary, it took the same jobs submitted via marlin ~9 minutes, and via 
ldg-portal ~161 minutes?


---------- Forwarded message ----------
Date: Wed, 20 Jun 2007 15:05:53 -0500 (CDT)

ldg-portal
 	file access times:
 		logs range from 10:57 - 13:36
 		1600 jobs, logs updated with total of 42 uniq hour:minute
 			timestamps (updated in "clumps" of 26, 74, 29, 21,
 			 58, 21, 16, 62, 27, 76, 24, 74, 2, 24, 20, 60,
 			 20, 18, 63, 19, 81, 19, 68, 12, 22, 26, 50, 24,
 			 20, 60, 20, 76, 24, 75, 25, 20, 60, 20, 20, 60,
 			 20, 64)
 		1600 jobs, outputs generated with a total of 39
 			uniq hour:minute timestamps (updated in
 			"clumps" of 3, 23, 74, 29, 79, 21, 78, 27, 50, 26,
 			 24, 21, 55, 24, 80, 20, 81, 19, 50, 31, 19, 80,
 			 22, 76, 24, 80, 20, 50, 26, 24, 9, 66, 25, 80,
 			 20, 80, 20, 50, 14)

 		condor_q on marlin showed many jobs in "C" at any time,
 		 may show jobs in "H" while staging files (input or
 		 output?  Not sure), no more than 50 ever running
 		 concurrently, some samples (no "C" included):
 			17 jobs; 9 idle, 3 running, 5 held
 			58 jobs; 8 idle, 50 running, 0 held
 			19 jobs; 0 idle, 19 running, 0 held
 			11 jobs; 0 idle, 11 running, 0 held
 			40 jobs; 5 idle, 31 running, 4 held
 			57 jobs; 0 idle, 47 running, 10 held
 			5 jobs; 0 idle, 0 running, 5 held

 	condor_userprio showed user parmor as running, my prio is 1, all
 		others are 10

marlin
 	file access times:
 		logs range from 14:02 - 14:11
 			26, 96, 172, 181, 189, 207, 214, 257, 257, 1

 		outputs range from 14:00 - 14:08
 			40, 106, 221, 119, 194, 191, 214, 257, 258

Suggestions from A. DeSmet

Date: Tue, 3 Jul 2007 17:54:21 -0500
From: Alan De Smet <xxxxxxxx@cs.wisc.edu>
Subject: Re: Next Condor/LIGO phone call on Monday (July 2)

< 2) Initial performance numbers from condor-c efforts at UWM.
<  	Discussion of what enhancements are planned/needed to Condor-C to 
<  	help solve automatic job distribution accross different pools.

Per the call yesterday and Scott's reports of initial performance
numbers looking bad, the two most important settings to change
are GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE (GMPSPR) and
C_GAHP_CONTACT_SCHEDD_DELAY (CGCSD).  For all practical purposes, you'll
only get GMPSPR jobs transferred from your initial submit machine
to the second schedd every 2xCGCSD seconds.  the defaults are

GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=5
C_GAHP_CONTACT_SCHEDD_DELAY=20

So you'll only get 5 jobs every 40 seconds, which will explain
most of the slowdown you're seeing.

The second setting (CGCSD) is how often the gridmanager (via the GAHP) is
willing to bother the remote schedd.  Every 20 seconds seems a
reasonable default to me, but if you're interested in optimizing
for responsiveness, you can make it more aggressive at the cost
of keeping the remote schedd more busy.

The first setting is how many jobs can be in the process of being
transfered over.  Since we won't know that a job has been
accepted until the next poll in CGCSD*, this is a big problem.
I'd suggest making it much larger.  I suggest 100 as a starting
point.

*As mention above, actually 2 CGCSD intervals, as there is some
back-and-forth going on.

-- 
Alan De Smet                              Condor Project Research
xxxxxxxx@cs.wisc.edu                 http://www.condorproject.org/

Simple test by S. Koranda after A. DeSmet suggestions

I edited the local config file /opt/condor/home/condor_config.log on ldg-portal.phys.uwm.edu (the stand-alone condor pool) and added

GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=100
C_GAHP_CONTACT_SCHEDD_DELAY=5

To compare apples-to-apples I then ran the exact same suite of jobs that P. Armor ran. The jobs are simple--they call hostname and sleep for 3 minutes.

The first job (of any) was submitted to a "grid" resource at 09:24:58 as indicated in the job's log file. The last of (of all) was terminated at 10:51:07. So the total throughput took about 86 minutes. This is definitely an improvement.

Note that during the test no more then 100 jobs at any given time ran on the cluster. Often there would be 100 running, then these would "drain" and be replaced by another 100.

Output from ps auwwwwwx | grep condor on ldg-portal.phys.uwm.edu while jobs being serviced:

skoranda  7944  1.5  3.2  47964 30820 ?        S    09:24   0:23 condor_gridmanager -f -C (Owner=?="skoranda"&&JobUniverse==9) -S /tmp/condor_g_scratch.0xb0ec90.7928
skoranda  7946  0.0  0.3  20020  3720 ?        S    09:24   0:00 /opt/condor/sbin/condor_c-gahp -f -s marlin.phys.uwm.edu -P condorcm.phys.uwm.edu
skoranda  7947  1.9  0.4  20256  3960 ?        S    09:24   0:29 /opt/condor/sbin/condor_c-gahp_worker_thread -f -s marlin.phys.uwm.edu -P condorcm.phys.uwm.edu
skoranda  7948  2.2  0.4  20576  4252 ?        S    09:24   0:33 /opt/condor/sbin/condor_c-gahp_worker_thread -f -s marlin.phys.uwm.edu -P condorcm.phys.uwm.edu
root     10160  0.0  0.0  55932   708 pts/0    S+   09:49   0:00 grep condor




Supported by the National Science Foundation. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)
$Id$