LIGO Condor telecon September 24, 2007 Condor: Peter Keller Kent Wegner Alain Roy LIGO: Stuart Anderson Scott Koranda Greg Mendell Vladimir Dergachev General ------- 1) Call for volunteer secretary (take and post minutes) Scott when on the call will take notes. 2) Discuss switching to an earlier time to facilitate participation from Europe It appears that the time slot of Friday at 11AM Pacific is a possibility for everyone on the call today and matches the requirements for Carsten to be able to join. VDT --- 3) I believe Ed Maros has generated a list of additional globus client libraries need to run the current LIGO TCLGlobus client applications. Where does this currently stand? 4) Release schedule for VDT 1.8.1 Tuesday or Wednesday of this week most likely. 5) Verify that vdt-version will return a unique string when 3-digit VDT releases are patched. New Condor points of interest ----------------------------- 6) Stager starting large DAGs that have expensive transient start up costs. For example, DAGMan to allow dynamic changing of MAXJOBS? Allowing nodes to change color with time? Other ideas from Scott's meeting in Madison Scott explains about idea that he discussed with Miron and Todd to have jobs scheduled and resources claimed, but to have the starter not actually start the job and instead let it wait a configurable timeout. Stuart points out that the cluster would not be efficiently use. Scott agrees but is willing to pay that cost to have the jobs staggered so that NFS is not killed. Long discussion ensues about whether a DAGman maxjobs that would be dynamic would be sufficient and would work. Turns out that sleep is not implemented in Condor standard universe because of details in how sleep is implemented sometimes using signals. Stuart wonders about nanosleep since it may not use any signals. Pete then came up with what might be the winning short-term solution and Kent will investigate what it would take to implement. In particular, Condor already has a crontab like start functionality to match jobs and get all ready to start but delay the Starter until a fixed point in time. The idea being that for the short-term rather than getting a token from some new yet to be developed Condor service or using a dynamic MAXJOBS, DAGMan could stagger start jobs by giving them not to start before times to the existing crontab Starter functionality. 7) Managing multi-threaded applications. Pete to investigate what Condor can do today. Vladimir offered the additional option that his code could run on a Condor specified number of threads for opportunistic running. 8) [condor-admin #16017] condor_q analyze support for Local Universe Pete to open internal ticket at very low priority to cleanup output. 9) [condor-users 10 Sep 2007 email] Local universe scheduling Pete to investigate if the Negotiator really needs to be involved in scheduling/starting Local Universe jobs. Condor-DAGMan ------------- 10) [condor-admin #15811] LIGO: intra-DAG node prioritization and throttling Coloring nodes within a DAG for priority and "maxjobs" Kent 75% done, should be finished this week. Will just require an updated DAGMan binary to deploy on existing LIGO Condor pools. 11) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Peter to work on time permitting. 12) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs Not started yet. 13) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan Not started yet. Condor-C -------- 14) Any update on multi-homed condor-c setup at UWM with 6.9.4. No update, Scott had to leave call early. Condor-misc ----------- 15) Any problems with clipped port to RHEL5 or initial work on full port? Clipped port after 1/2 day of Pete's time and ~2 Months for full port. 16) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. No update. 17) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically No update. 18) [condor-admin #14006] LIGO: append to stdout/err files on re-execution No update. 19) [condor-admin #15287] LIGO X509 certificate management enhancement request. No update.