LIGO Condor telecon October 18, 2007 Condor: Peter Keller Todd Tannenbaum Zach Miller Kent Wegner LIGO: Stuart Anderson Scott Koranda 0) Is the new Condorligo@aei.mpg.de mailing list working? Yes. VDT --- 1) Schedule for 1.8.1b patch to support conditionally using native TCL? No update. New Condor points of interest ----------------------------- 2) [condor-support #2123] LIGO: Corrupt file output in the Standard Universe: a) Many thanks to Peter for such a quick turn around on a fully tested patch so we did not have to revert our upgraded condor pools. b) However, why did it take LIGO 3 users reporting a problem over a Month time period before we opened a Condor ticket? Did we let down are users? Wrong judgemental call balancing supsecting Condor. c) Impact and status for GEO clusters? They have patched. d) Is LIGO really unique in writing binary files with 6.9.3-4 in the Standard Universe? Only 20% of sites using Unix use standard universe, and only a percentage use the developer release. e) Status of providing updated Inspiral regression test that performs binary file output. Want to get a newer version that found this problem. LIGO working on this. f) What is the schedule for 6.9.5, i.e., should we continue to deploy and patch 6.9.4 on additional clusters or wait for 6.9.5? Should be code-complete next week and will be on the web a week after that. g) Any useful RFEs for built-in Condor data integrity checks? Can't think of anything clever at this time. A very general question. CERN has done some studies and shown that for very large amounts of data processing the number of silent errors is larger than suspected. Wondering what role Condor can play in helping with this problem? 3) Managing multi-threaded applications. Peter, what was learned about current Condor capabilities? Peter did look. One of team members had thought about this and wrote up a document about how it could be done. This particular idea that when job submitted put attribute into job add and adjust start policy so that only first slot can start these and the other slots can start jobs or not based on other attributes exported. This document could be cleaned up and then distributed to the list. Peter is going to do it. There are plans to put more control in after the next development cycle is over. 4) [condor-users 10 Sep 2007 email] Local universe scheduling Peter, what did you discover about whether teh Negotiator really needs to be involved in scheduling these jobs? Zach reports this is an annoyance (so not a feature). His testing has verified that jobs will wait the length of a negotiation cycle. Will look and see if can remove the this annoyance soon. Zach shows that if have a lot of jobs submitted you won't see this latency. Only when submit a giant batch at just the "wrong" time. It is not how full the queue is, but how often the submits happen. Only get triggered on next eternal event (like the next submit). Still, annoying and will talk about how to fix it. NEW ITEM: ticket 15795: Stuart still seeing this all though marked fixed. Stuart would like it confirmed that the fix did not make 6.9.4. These were shadow fixes to keep them from making so many stat() calls. "Shadow bloat fixes" is how Peter refers to it. Turns out missed 6.9.4. Will have to wait until 6.9.5. Condor-DAGMan ------------- 5) Stager starting large DAGs that have expensive transient start up costs. Is there a ticket open yet for automatic DAGMan staggered crontab start? Did some investigating and one thing came up with was, depending on how smart a solution needed, did a test combining job start deferral with random choice in submit file, have the jobs not actually start up for a random amount of time (within certain limits). Limitation is have to profile deferral times according to how many of these jobs think one has. This would slow down the entire pool? Really need to have a throttle on nodes in the startup phase. What about just (for now) limiting the rate at which certain jobs in a DAG can be submitted? Kent is going to think about that a little more. Internally DAGman has submission cycles and would have to control across these cycles, and so this may be a bit tricky. 6) [condor-admin #15811] LIGO: intra-DAG node prioritization and throttling Coloring nodes within a DAG for priority and "maxjobs" Any feedback from initial LIGO (or other) testers? No feedback yet. Kent mentions an idea Miron had. Assigning a weight to nodes so that rather than just saying maxjobs = 10, different nodes could count differently against a resource. Doesn't seem hard to implement, and Stuart agrees could be useful. Kent has a PR for it but doesn't know number offhand. Few people know about this feature outside of LIGO. 7) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Pete has not gotten to it yet. One question: when request that sub dag is to be merged, do want phantom parent and child, or directly spliced into the parent DAG so that phantom nodes do not exist. No impact by phantoms. Need input by Duncan and Steve so Peter will email the list and ask. 8) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. "dir" command not making it into rescue command. A partial fix if not in 6.9.4 will be in 6.9.5. Still need more work to really fix the problem. Later we hear it did make it into 6.9.4. 9) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan No action. Condor-C -------- 10) Any update on multi-homed condor-c setup at UWM with 6.9.4. Note, the Caltech pool is also now running 6.9.4. No update. Condor-misc ----------- 11) 6.9 documentation on upgrading Quill is insufficient, any progress on a migration guide? LIGO may have found a need to re-eanble Quill even with the improved schedd performance--replace BOSS in Onasys. Have put a lot of work into documentation in the manual. Have worked with newbie users and reiterated again. A snapshot is up on the web site in place for the 6.9.x manual (just updated an hour or so ago). The instructions are "complete", though could still use more detail. All needed variables/macros are now documented. 12) Status of RHEL5/CentOS5 port (clipped and full)? Pretty much clipped part working, should be another few days, and then onto the standard universe part. Probably take 4 weeks for that. 13) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. On the roadmap for the development release after the 6.9.x branch. 14) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically Same. 15) [condor-admin #14006] LIGO: append to stdout/err files on re-execution Same. 16) [condor-admin #15287] LIGO X509 certificate management enhancement request. Same.