LIGO Condor telecon May 09, 2008 Condor: Kent Wegner Peter Keller Todd Tannenbaum LIGO Scientific Collaboration: Stuart Anderson Greg Mendell Scott Koranda Duncan Brown Friday May 9 at 11AM PDT/1PM CDT. 1-800-704-9896 (562487#). VDT/LDG ------- 1) Any issues from VDT 1.10 testing via LDG? Schedule for next VDT release for OSG 1.0? Email from Alain: The VDT 1.10.1 release will hopefully be Monday, otherwise Tuesday. Condor-DAGMan ------------- 2) [condor-admin #15848] LIGO: enhancement request to dagman Do we have a consensus on all the details associated with automatic rescue dag files? If not what remains to decide? Conclusion agreed to is that: -f flag to condor_submit_dag disables automatic running of rescue DAGs if they exist. Any existing rescue DAGs should be deleted or at least renamed and moved out of the way. Duncan would like automatic running of rescue DAG mode to have one writing to foo.dag.rescuedag.dagman.out/log, but daemoncore prevents this right now? Problem is that the desired behavior would get the file name from things learned later after DAGman is already running, along with a combination of the command line arguments. Todd wonders where you log things until the time when it is figured out where to long things? In daemoncore there are callbacks at various points that happen earlier to deal specifically with issues like this, but Kent is not sure one can get to the command-line arguments at that time? Todd thinks one might be able to do, though they will have to be parsed. Kent will look into this further. How important is this Duncan asks Kent? Duncan thinks it would be nice, but much more important to have the right handling of rescue DAGs. 3) [condor-admin #17531] LIGO: condor_dagman startup performance limited by DAGMAN_LOG Priority increased after realizing this is still an issue for DAGs put on hold and then released. Question: have we confirmed slowness is in writing the .out file versus the parsing? Yes, when writing to local file system much faster. Same old NFS sucks problem. Do have to close file after every line is printed? Have to look at the dprint stuff and optimize. 4) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan Formerly of reduced priority given initial fix for 17531, but now that 17531 still shows very slow restart for condor_release this is back to moderate priority. Process gets killed. There is some reason with how communicate with DAGman under Windows that has caused this...other things might have changed. Hold should not mean kill, so Todd thinks need to find a solution for this. 5) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Peter is writing a small collection of tests for this, and he is finding small bugs, but just needs to document this and it will go into trunk middle of next week. 6) [condor-admin #17526] LIGO: dagman crashes when strace'd. Has this been reproduced on CentOS 5? No update yet. DAGman dies with a segmentation fault. Hard to debug this and will take real time. Condor-general -------------- 7) [condor-admin #17975] LIGO: CentOS 5 condor jobs are not checkpointing With 7.0.1 at Syracuse the standard universe jobs will not seem to checkpoint. The starter on the node it gets the signal, but the process does not seem to respond and it just goes on and then it gets the SIGKILL and it goes away. Can try running it by hand and sending signal, but need to review setarch first or the job will segfault. Duncan to also send ShadowLog. 8) [condor-admin #17983] LIGO: stderr not captured in 7.0.1 standard universe Tried a 'hello world' from Jamie and that worked, so Duncan will try again with the LIGO code and see what happens. Will look into it further. Duncan to send ShadowLog. 9) Any problems reported in the wider condor universe that may be of interest to LIGO? None with standard universe. Quill is being trail blazed by Purdue. They are finding issues on a 7000 core pool, so hopefully in a few weeks will be more scalable and robust. Most will go into stable branch since not new features but more scaling fixes. Kipp wants to use Quill to replace BOSS database. Todd reports on starting to do submission via database. An inserted row in a database results in a submission and when job completes Condor submits another row. Started work on number 14 below. Have been getting useful comments. Next developer release has the startd being able to pull work from whenever, and not just receiving work from the schedd and the shadow. 10) Status/Issues associated with standing up a new 5000 core Condor pool at Hannover? 11) [condor-admin #17748] LIGO: noop assertion error No update. 12) Interesting ideas from Condor Week. See above for item (9). 13) LIGO condor-c and GT4 progress. Probably want to have schedd report to two collectors instead of just the normal collector. A collector that holds all the schedd for the LIGO Data Grid. Scott will regress to previous production configuration and send copies of existing configuration files to Todd. Scott and Todd will plan the next visit to continue. * Call ended here due to running out of time. Previously delayed tasks for next development branch ---------------------------------------------------- 14) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. 15) [condor-admin #15287] LIGO X509 certificate management enhancement request. Any impact on this based on recent LIGO draft plans for using myproxy? 16) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Todd to consider a short term fix for the stable branch while Tristan starts work on a completely new Shadow that supports all Universes equally well for the development branch. 17) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically. 18) [condor-admin #14006] LIGO: append to stdout/err files on re-execution. 19) How to avoid thousands of jobs "flushing through a funky node"? e.g., if a job fails on a slot consider automatically releasing the machine claim for that user? Hoping for Todd's example startd policy expression to black-list individual users on "black hole machines". 20) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor 21) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results