LIGO Condor telecon November 30, 2007 Condor: Peter Keller Zach Miller Kent Wegner LIGO: Stuart Anderson Scott Koranda Kipp Cannon Carsten Aulbert Duncan Brown (late) Pegasus: Karan Vahi 0) Does the mail header for this mailing list now provide a valid return address? Yes, fixed. VDT --- 1) Schedule for CentOS 5 x86_64 and Debeian Etch support. Scott found that CentOS 5 isn't supported for x86_64? 2) [condor-admin #12696] Stupid question? rpath set to /home/condor for VDT build of GT New Condor issues ----------------- 3) Carsten's question regarding scaling to 5000 slots with 4 submit machines Will have 1350 machines, each one 4 cores. Plan is to have 4 submit machines. 5 to 6 thousand is current limit for schedd with authentication, without can get to 10K. 4) [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop Stuart sent in a simple test case. Ideally Condor would notice infinit recursion and throw an error. Short of that it would be nice if did not require SIGKILL. Also condor_rm isn't working when trying to remove the job (probably blocking SIGTERM and SIGQUIT). 5) [condor-admin #17237] LIGO: remote file IO despite WantRemoteIO = FALSE Appear to be getting remote IO when have WantRemoteIO = FALSE. Pete has not had time to look yet. 6) [condor-admin #17226] LIGO: shadow assertion error in pseudo_ops.C Not happening anymore, not sure if indicating a serious problem or not. Greg Quinn has the ticket. Could be, if had prerelease of 6.9.5, maybe file transfer wire protocal is not compatible. If running a legitimate 6.9.5 should not be a problem. Went through several iterations of the file transfer change between 6.9.4 and 6.9.5. 7) [condor-admin #17225] LIGO: schedd log file management Assigned to Todd. 8) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor Is this a race condition? Assuming that once a Condor daemon has closed a file on one machine the log reports it and other things assume it is available from all machines (on a shared file system). Noticed with short running example jobs, Karan wonders if this is something the Pegasus team has already seen? Not known, but could check. Stuart put in a quick fix using flock(), but may be too heavy handed? New item: ----------------------- Kent's email regarding rescue dags and how LIGO would prefer they are handled. Kipp Cannon says there is already an existing rescue flag for DAGman, so a new flag should have a different name. Another problem is that because you can force rescue dag to have whatever name want, can't have DAGman know what files to look for? Running a top-level rescue DAG does not run a lower-level rescue DAG, it runs the full DAG. Kent's basic idea Kipp likes, just needs to be adjusted a bit New item: ---------------------- Peter's email about splicing together of DAGs. Problem is that, will a human being or a higher level program be writing them? To clarify, a splice is part of DAG in a different file will be put into a larger DAG. Current method is clumsy because have to run condor_sumit_dag -nosubmit to generate a submit file for the sub DAG, would like to be able to treat a sub DAG like a job and just in the larger DAG treat the sub DAG like another job. Peter asks if any of the nodes in sub DAG fail, do want to restart just that node or the whole sub DAG? Duncan just wants that one node to be rerun. Peter sees more of the very fine detail on how Duncan wants things treated so can make some decisions now. Duncan wants this to be a "hash include", ie. "#include". Kipp adds that solving this problem means that Kent's handling of the rescue DAGS would be solved, so do we need both? Stuart reminds us that Kent wanted to solve that anyway for non-LIGO users, Kent agrees. Peter asks about the case where the same splice put multiple times into a DAG? Yes, Duncan says that could happen. Peter will mail back onto that thread with proposed syntax and ask Duncan if that is what looking for? Previous Condor issues ----------------------- 9) [condor-admin #17205] LIGO: condor_rm cleanup of Local and Scheduler jobs Did the Local universe cleanup make it into 6.9.5? Trying to confirm with Greg Thain but not sure... 10) [condor-admin #17209] LIGO: reduce cost of catalog building Did this fix make it into 6.9.5? Minor issue about making 2 stat calls when could do one? This is on the execute machines. Nick was going to look at this. Doesn't look like made it into 6.9.5. 11) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Enhancing the standard universe the same as the vanilla universe Also not fixed yet, on would be nice list. Will have to go into next development branch? Not sure, could be a bug fix possibly, assigned to Greg Quinn. 12) [condor-admin #17159] LIGO: fcntl 64bit bug Peter, Vladimir have you closed the loop on this? Peter thought he had a workaround, but realized he needed to know if Vladimir was locking regions of a file or the whole file? If doing just whole file, Vladimir can use flock() instead. 13) [condor-admin #17143] LIGO: ImageSize update problem Peter investigating what it would take to have the same level and frequency of reporting for the Standard Universe as for the Local. Corollary--if this is too much for the standard it is probably too much for vanilla and how do we scale back on large pools. Peter hasn't had time to look at this. 14) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results Discussion of updated flock patch that preserves semantics. Not done, but understood what is going on. 15) [condor-support #1789] LIGO: Condor re-running old jobs Derek was going to investigate whether the remaining corner cases mentioned in this resolved ticket been resolved in 6.9.4. Zach confirmed this is completely resolved. 16) [condor-support #1750] LIGO condor_starter core dumps Derek to confirm that items a) and b) been taken care of so this ticket can be closed? A lot more work to fix for a somewhat minor corner case so not resolved. Stuart, when happened get 10K core dumps a day, so not so minor. Zach checked with Derek, he says someday it would be nice to fix. Stuart says happened a year ago but has not happened again. 17) [condor-support #1663] LIGO: condor_submit problem and a couple of feature requests condor_submit segfaults on too long of an expression fixed. several parallel job RFEs by DUNCAN. That was donkey years ago Duncan says. There is an effort to go through and fix fixed-size buffer issues. Peter might have accidently fixed this when just code cleaning. Kipp ads he has been running jobs recently with really long class ads and has not seen a problem. Duncan had some other requests in that tick having to do with running parallel jobs. After his cluster is up he will check on those again. 18) [condor-admin #16017] LIGO: condor_q analyze support for Local Universe Condor-DAGMan ------------- 19) Stager starting large DAGs that have expensive transient start up costs. Any update? Scott have you tested DAGMAN_SUBMIT_DELAY to solve this problem? No update from Scott. 20) [condor-admin #15811] LIGO: intra-DAG node prioritization and throttling Coloring nodes within a DAG for priority and "maxjobs". An updated pre-release has been installed at CIT for testing. Steve, is this still good to go for the next stable branch from LIGO's perspective? This is working for LIGO now using the actual 6.9.5 binaries and it just worked. Condor-C -------- 21) Any update on multi-homed condor-c setup at UWM with 6.9.4. Going to upgrade to 6.9.5 next week and then try at that time. Condor-misc ----------- 22) Status/schedule of RHEL5/CentOS5 full port. c++ runtime library and thread handling for exceptions? other issues? Peter says there was a bad showstopper and so has to do a full port now and that will take a while. Still hoping two months of less from today. Still working on it. Are the limitations limited to only c++ Stuart asks? Peter says no. The compiler does something forcing thread local storage registers. If use any c++ all this becomes active. So LIGO could still compile on FC4 and then run on CentOS5 pool. DMT does use C++. Root is not using standard universe. 23) 6.9.5 release status/schedule. Released! New item: ------------------------------------ ticket: 14576, want to keep failing jobs from thrashing pool, did that make it into 6.9.5? 4 new job class ad counters but can't find them in the documentation. Version history has the names but no information so need updated docs. 17258 is ticket on this documentation problem. SCHEDD_QUERY_WORKERS? What are the queries talking about there? What is this option really? Schedd forks a thread to handle condor_q. This controls the number of processes allowed simultaneously to do this. Delayed tasks for next development branch ----------------------------------------- A) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. B) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically C) [condor-admin #14006] LIGO: append to stdout/err files on re-execution D) [condor-admin #15287] LIGO X509 certificate management enhancement request. E) [condor-admin #17092] LIGO: Local universe scheduling latencies F) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan G) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. H) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Partial "dir" fix has been confirmed to work