LIGO Condor telecon February 15, 2008 Condor: Todd Tannenbaum Kent Wenger LIGO Scientific Collaboration: Stuart Anderson Carsten Aulbert Steffen Grunewald Greg Mendell Added during meeting: A) 7.0.0 bug found that may be of interest to LIGO: The ckpt server must be of the same word size as the shadow for checkpointing to work, e.g., no mixing of x86 and x86_64 between submit machine and checkpoint servers. I am not certain, but I believe Todd indicated this was fixed for 7.0.1. B) [condor-admin #17465] condor_master core dump Mistakenly filed without the email subject prefix of "LIGO" so not on the LIGO support tickets page. This only happens on a misconfigured machine, but Todd plans on investigating since it is easily reproducible. VDT --- 1) Recent Debian Etch issues to consider for the next VDT release. Does Alain have all the feedback needed from Carsten and Steffen? Carsten reports: I think yes, if not please tell me what you need (or if you want to have access to the box). https://n0.aei.uni-hannover.de/LIGO/ldg-install.bz2 This is the complete log of the install script, mostly pacman -v all 1.5) Scheudle for 1.9 release with Debian support. Approximately March 17th, assuming no serious obstacles surprise us. It will probably be labeled as VDT 1.8.2. Condor-DAGMan ------------- 2) Stager starting large DAGs that have expensive transient start up costs. testing DAGMAN_SUBMIT_DELAY to solve this problem at UWM or AEI? Scott reports: I have asked Matthew Pitkin to test this for us since he was recently submitting DAGs that met this criteria. He has agreed to try it. I will report what he/we find... 3) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Timeline for PR 598 and 788? Kent predicts "a couple of weeks" for this work. 4) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan Additional reports from LIGO users that very large DAGs do not always recover from being put on hold and restarted. Not clear yet if this is a Condor bug, but general agreement this enhancement is worth doing if for no other reason it takes several hours to restart a large LIGO DAG (apparently limited by file I/O). Development not scheduled yet. 5) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Pete to investigate after RHEL ports are completed. 5.5) Any technical feedback from ISI/Pegasus meeting earlier this week on where the integrated Pegasus/DAGMan development effort is headed. Todd thniking about pushing workflow all the way down to the startd on execute machine to handle a large number of very short running jobs. Todd thinking about submitting jobs via updates to a table in a database. Condor-release -------------- 6) Status of RHEL5/CentOS5 x86_64 full port including any initial feedback from Syracuse or Hannover on x86 7.0.1pre-release testing. This code now "Franks", i.e., code copmlete/compiles/links. Even better it passes the Standard Universe test suite and is now running on NMI to look for unexpected and undesirable side effects on other regression tests. 7) 7.0.1 release still scheduled for Feb 20? Code freeeze is today (Feb 15) so still on schedule. 8) Packaging plans (/lib64 and root location, /opt/condor-x.y.z, /usr, Is it required to set LIB when using system default locations? RH packaging does not yet include Standard Universe but initial package will set RELEASE_DIR to /usr and put perl and java support files in /usr/share/condor. It appears that once RH packages standard universe there may be a split of libraries as some will go in the traditional LIB location but others will likely be installed in a condor specific location. It is not yet clear which version of condor RH will first start bundling that includes the standard universe. Note in passing, Todd estimates that approxiamately 20% of condor sites use the Standard Universe. 9) How to avoid thousands of jobs "flushing through a funky node"? e.g., if a job fails on a slot consider automatically releasing the machine claim for that user? Waiting for Zach's best practices recipes for dealing with "black hole machines". In general, the most robust method of dealing with this appears to be flagging bad execute machines based on anomonously short job run times, for example <1 sec jobs are a strong indicator of a black hole--missing shared library, missing NFS mounted filesystem, ... Todd discussed a Hawkeye solution that publishes black hole status into class-add that jobs can match against. Another possibility that may be more LIGO friendly is to modify the startd policy to keep a black list of users whose jobs run too fast on a particular machine and not accept anymore of them. Todd will circulate an example expression statement for this. 9.5) Regression testing. What can/should LIGO provide to help with regression testing of future Condor releases for LIGO workflows, e.g., binary file I/O, large DAGs, backfill jobs (issues like ticket 2146), non-inspiral search codes, ... Is there a common test environment that both LIGO and Condor could adopt to allow for dual-testing? If LIGO writes up additional regression tests using the mentronome software (perl scripts) Condor would be willing to run short jobs (< few minutes) as part of the nightly test suite run in NMI or longer jobs (< few hours) as part of the weekly tests. Examples test scripts are distributed with condor, e.g., src/condor_tests/job_core_onexithold_van.run (more generally any *.run file in this directory). A longer-term solution may be for LIGO to get an account at NMI and setup/run its own tests there. Previously delayed tasks for next development branch ---------------------------------------------------- 10) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. 11) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically. 12) [condor-admin #14006] LIGO: append to stdout/err files on re-execution. 13) [condor-admin #15287] LIGO X509 certificate management enhancement request. 14) [condor-admin #17092] LIGO: Local universe scheduling latencies. No work yet on 10-14). LIGO will provide a prioritized list of these (and other) open RFE tickets. Recent Condor issues ----------------- 15) [condor-support #2158] LIGO: multiple schedd core dumps Investigate or close? Should be closed. 16) condor_ckpt_server related issues. Todd to consider short term and cost effective patches short of new daemon core porting and sooner than generic file storage tool? Did patch for auto shutdown on disappearing parent make 7.0.1? patch made it for 7.0.1. 17) [condor-admin #17291] LIGO: job on hold without a reason. not reproducible at UW. 18) [condor-admin #17283] ancillary suggestion for a crash report tool. general agreement, but no action taken. Condor-C -------- 19) Any update on multi-homed condor-c setup at UWM. no update. Previous Condor issues ---------------------- 20) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Todd to consider a short term fix for the stable branch while Tristan starts work on a completely new Shadow that supports all Universes equally well for the development branch. A strong candidate for being near the top of the list of LIGO's prioritized RFE wish list. 21) [condor-admin #17237] LIGO: remote file IO despite WantRemoteIO = FALSE No update. 22) [condor-admin #17159] LIGO: fcntl 64bit bug Should be closed with comments from Peter at the last meeting. Closed. 23) [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop No update. 24) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor 25) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results Email exchange in RUST on additional possibilities but not definitive plan yet. 26) [condor-admin #17143] LIGO: ImageSize update problem Peter investigating what it would take to have the same level and frequency of reporting for the Standard Universe as for the Local. Corollary--if this is too much for the standard it is probably too much for vanilla and how do we scale back on large pools. No update.