LIGO Condor telecon January 18, 2008 Condor: Todd Tannenbaum Kent Wegner LIGO Scientific Collaboration: Stuart Anderson Scott Koranda Carsten Aulbert (late) Pegasus: Condor-release -------------- 1) Status/schedule of RHEL5/CentOS5 full port. Pete has it working with standard universe. Some not surprising issues have cropped up having to do with cross-platform binaries (statically linked) because of non-forward compatibility of gcc and glibc and the like. LIGO can get a snapshot for testing (by Duncan on the Syracuse cluster). It is a 7.0.1. pre-release. 2) 7.0 release status/schedule. 7.0.0 should be on the web today. The source tarball is also going to be available. Redhat has made a good start on packaging but it is not complete enough yet to roll it back in. Redhat MRG is a product, the first Redhat release of Condor stuff. Redhat takes Condor and changes the packaging around and re-releases it. They send patches upstream. Condor will also show up in RHEL in next major release, but it is probably not for a year or so. But Condor should show up in the next Fedora Core. Steffen asked in email about Debian/Ubuntu, but Debian Edge is what we are more interested in. Recent Condor issues ----------------- 3) [condor-support #2158] LIGO: multiple schedd core dumps 6 crashes in 29 hours, and none in the last 2 weeks after removing some grid jobs that where on hold in the queue. The user was unable to re-submit the same jobs so the 7.0.0 pre-release schedd was not tested for this bug. Looks like it disappears when jobs on hold are removed. Nick LeRoy was waiting for more information. 4) Set of inter-related 6.9.5 issues: a) condor_master core dumps after a few startd crashes (admin 17268) Waiting for Greg to mark this ticket resolved. Note, the last email exchange requesting this be closed does not show up in the ticket history itself--problem with rust? Looks closed now. b) condor_ckpt_server is left orphaned (admin 14515) Old ticket. If the master crashes it leaves ckpt_server orphaned. It pre-dates daemonCore. c) subsequent condor_maser restart gets stuck trying to restart ckpt server (admin 17266) after a work around for (support 1750) is in place. See (b) above. LIGO uses a ckpt_server on each machine. In the long term plan hope that Condor can work with a generic file storage tool since the ckpt_server is just a rather crude file storage device (for just checkpoint files). LIGO would like more robust check pointing. Maybe a workaround is possible in the meantime. Todd can look at this. In general trying to not spend a lot of time on ckpt_server since it will go away. 5) [condor-admin #17291] LIGO: job on hold without a reason No update. Condor-DAGMan ------------- 6) Stager starting large DAGs that have expensive transient start up costs. testing DAGMAN_SUBMIT_DELAY to solve this problem at UWM? No progress to report from Scott. 7) Confusion over plan to put some bug fixes in 7.0 and others only in 7.1. Kent didn't want to change two branches. It has to do with the CVS mechanics. Stuart points out this is just a philosophical issue, not an immediate practical problem. From now on bug fixes go into the stable series. Kent can send Stuart the list of three bugs and if Stuart thinks important they can be backported. Will the 7.1 branch of DAGman have all the fixes? Yes. Scheduled for post 7.0 development but perhaps some of this has started? ------------------------------------------------------------------------ 8) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan Nobody has started this yet. 9) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Pete is somewhere in the middle of that. But also occupied with the standard universe port. 10) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Partial "dir" fix has been confirmed to work Kent is partway done with a setup that will automatically run the rescue DAGs at all levels (if the one at top level is run). Has started coding but not completed. This is driven by Onasys and also the LIGO/Pegasus work ( in particular the iHope inspiral workflow). Probably not available to March due to travel. Condor-C -------- 11) Any update on multi-homed condor-c setup at UWM. Scott worked with Jaime last week and made some progress. Will visit Madison again soon and try to finish. Previous Condor issues ---------------------- 12) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Enhancing the standard universe the same as the vanilla universe Still open question regarding adding to stable branch? Tristan is probably 4 weeks away from starting an effort to move towards new shadow. The old standard universe shadow should disappear and all the exisiting shadow code for non-standard will be made to also handle standard. Todd will look to see if there is a short-term fix that might help. 13) [condor-admin #17237] LIGO: remote file IO despite WantRemoteIO = FALSE No update. 14) [condor-admin #17159] LIGO: fcntl 64bit bug How does this relate to the major glibc update for CentOS5 support? Will it now be possible for users to add the extra argument to fcntl? Todd will have to ask Peter, though it seems like with the glibc update it should be doable. 15) [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop No update. 16) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor Relative timing between submit and execute machines for shared file system. fclose() on one doesn't mean the results are available immediately elsewhere. 17) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results Related to above. 18) [condor-admin #17143] LIGO: ImageSize update problem Peter investigating what it would take to have the same level and frequency of reporting for the Standard Universe as for the Local. Corollary--if this is too much for the standard it is probably too much for vanilla and how do we scale back on large pools. Todd will look into it. 19) [condor-admin #17225] LIGO: schedd log file management Observations by Stuart. Todd doesn't recall why the structure is the way it is. 20) [condor-admin #16017] LIGO: condor_q analyze support for Local Universe "Target of opportunity"? It is a feature request. Next development in condor_q analyze is to get condor_better_analyze working on more platforms. Getting it working for local universe has not had any thought put into it. Added by Carsten and Steffen: Issue of many-core architectures and how to scale the Bologna batch system? For development series for the startd the team is exploring allowing startd to act as a pull workload manager, should be in first development release after 7.x. Also Derek will work on dynamic slots for startd. Startd would advertise lump sum of resources available and this would be dynamic as the amount of resources changes due to running jobs. Jobs can advertise how many cores they want (or are willing to be charged for). For now will assume advertised number is static and is the maximum. If using more resources than claimed, startd could have policy to suspend or preempt or whatever. What if need a bunch of threads to start something, but later they throttle back (Stuart asks)? Delayed tasks for next development branch ----------------------------------------- A) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. B) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically C) [condor-admin #14006] LIGO: append to stdout/err files on re-execution D) [condor-admin #15287] LIGO X509 certificate management enhancement request. E) [condor-admin #17092] LIGO: Local universe scheduling latencies F) [condor-support #1663] LIGO: condor_submit problem and a couple of feature requests several fixed-size buffer issues have been fixed and the ball is in Duncan's court to re-initiate Parallel Universe issues if any remain. G) [condor-support #1750] LIGO condor_starter core dumps Derek confirmed there are corner cases that would be nice to fix someday Not currently a priority since it has not happened for a year, but when it did, there where 10k core dumps a day.