LIGO Condor telecon February 1, 2008 Condor: Peter Keller Zach Miller LIGO Scientific Collaboration: Stuart Anderson Carsten Aulbert Steffen Grunewald Greg Mendell VDT --- 1) Recent Debian Etch issues to consider for the next VDT release. Carsten has sent some of the requested information to the condorligo mailing list. Condor-DAGMan ------------- 2) Stager starting large DAGs that have expensive transient start up costs. testing DAGMAN_SUBMIT_DELAY to solve this problem at UWM? Scott and Steffen have recently run into additional users where this may be useful but it has not been tried yet. See the condorligo mailing list for a few recent suggestions on how to test this. 3) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Kent added post facto: I'm working on a fix for this (see Gnats PRs 598 and 788). Basically, we want DAGMan to be more sophisticated in dealing with the rescue DAGs in general, and part of that would enable us to automatically run the rescue DAGs in the nested case. I'm not sure about the timeline, though. Also, this would 99+% certainly be seen in the 7.1 series, not the 7.0 series. 4) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan No work on this, but it came up again after the meeting when a user lost a week's worth of work due to the incomplete recovery of a DAGMan process after it was killed for being put on hold by the user. 5) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Peter getting ready to work on this now that the RHEL5 port is feature complete. Condor-release -------------- 6) Status/schedule of RHEL5/CentOS5 full port including any initial feedback from Syracuse on 7.0.1pre release testing. This is now feature complete for x86, and has been submitted for code review, followed by some additional testing before being checked in to the stable branch. There is every expectation that this will make the 7.0.1 release scheduled for Feb 20. Peter provided an updated 7.0.1 pre-release for testing by Duncan and Steffen. It was also discussed after the meeting that it should be relatively easy to port to RHEL5 x86_64 now that the x86 port is done. Not yet clear what the schedule is for that, but both x86 and x86_64 are equally important for LIGO. 7) Packaging plans (/lib64 and root location, /opt/condor-x.y.z, /usr, ...) See posting to this list for a few questions, i.e., http://lists.aei.mpg.de/cgi-bin/mailman/private/condorligo/2008-January/000074.html The plan is for RedHat to distribute 7.0.1 as their first bundled release, and that they will provide additional packaging updates up-stream to the Condor team (that expects to accept the changes) for this release. No details available yet on what changes RedHat is planning on. 8) Trivial example configuration file typo: condor-7.0.0/etc/examples/condor_config.generic:## HOSTALLW_WRITE, or else your GlideIns won't be able to join your pool. No discussion. 9) How to avoid thousands of jobs "flushing through a funky node"? e.g., if a job fails on a slot consider automatically releasing the machine claim for that user? The technical term for this is "Black Hole machine". While Black Holes are one of the prime targets for LIGO research we are hoping to avoid creating them in our computing systems. Zach agreed to send some of the best practices recipes for dealing with this situation. Previously delayed tasks for next development branch ---------------------------------------------------- 10) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. 11) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically. 12) [condor-admin #14006] LIGO: append to stdout/err files on re-execution. 13) [condor-admin #15287] LIGO X509 certificate management enhancement request. 14) [condor-admin #17092] LIGO: Local universe scheduling latencies. No acition 10)-14) though they are still considered worth working on. Recent Condor issues ----------------- 15) [condor-support #2158] LIGO: multiple schedd core dumps Investigate or close? Up to the Condor team to close or not. 16) condor_ckpt_server related issues. Todd to consider short term and cost effective patches short of new daemon core porting and sooner than generic file storage tool? Todd reported prior to the meeting that there is a patch going in for 7.0.01 to have the checkpoint server to shut itself down if its parent goes away. 17) [condor-admin #17291] LIGO: job on hold without a reason This has been looked at by the Condor team, but not reproduced yet. 18) [condor-admin #17283] ancillary suggestion for a crash report tool: Please consider adding a condor specific postmortem data collection tool to make bug reports easier, e.g., # condor_crash_report node174 -event "12/6 14:59:06" -history 3600 Could aggregate the following and possibly phone home with the information: * Configuration files of interest (submit machine, central pool manager, and specified node174). * All log files for 1 hour before the specified time from these machines. * Run /usr/bin/ident on all Condor binaries. * Search for and include any core files found, possibly running a few standard gdb commands on the local machine first. * Basic information on the host computer (OS, patch level, memory, CPU, ...) ... Still sounds like a good idea for someone to do... Condor-C -------- 19) Any update on multi-homed condor-c setup at UWM. No report. Previous Condor issues ---------------------- 20) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Todd to consider a short term fix for the stable branch while Tristan starts work on a completely new Shadow that supports all Universes equally well for the development branch. No development work. 21) [condor-admin #17237] LIGO: remote file IO despite WantRemoteIO = FALSE No work. 22) [condor-admin #17159] LIGO: fcntl 64bit bug Todd suspects a 3rd argument call will work in the new glibc branch but that needs to be confirmed by Peter. Peter reported that this will not automatically be enabled by the new glibc port required for RHEL5 support and it was agreed this should be closed without any further work. 23) [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop No work. 24) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor Reported in Rust that this is indeed due to the same filesystem semantic assumptions as next item. 25) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results Reported to be likely fixed for 7.0.1, but not worked on yet. 26) [condor-admin #17143] LIGO: ImageSize update problem Peter investigating what it would take to have the same level and frequency of reporting for the Standard Universe as for the Local. Corollary--if this is too much for the standard it is probably too much for vanilla and how do we scale back on large pools. On hold until the higher priority porting work by Peter is completed. Note, please also see the posting to the condorligo mailing list for additional comments from Steffen, http://lists.aei.mpg.de/cgi-bin/mailman/private/condorligo/2008-February/000084.html