LIGO Condor telecon January 4, 2008 Condor: Nick LeRoy Zach Miller Alain Roy Kent Wenger LIGO: Stuart Anderson Pegasus: Karan Vahi VDT --- 1) Discussion of dropping FC4 support for the next release on the ~March timescale. Note, this is tied to the Condor port schedule for CentOS 5. Looks reasonable, but Alain will track the Condor port status and Stuart will double check with the rest of the LIGO collaboration. VDT-CONDOR ---------- 2) Distributing Condor in / rather than /opt/condor and impact on VDT. The default installation change will not be changed for the 7.0.0 release, but rather RedHat will take the Condor RPM make whatever changes they think are reasonable for integration with RHEL and send the changes back to the Condor team for consideration in a subsequent release. However, the condor RPMs should be fully relocatable so there is no need for LIGO to modify/rebuild the existing RPMs anyways. New Condor issues ----------------- 3) [condor-admin #17339] LIGO: schedd core dump in GridUniverseLogic::StartOrFindGManager Should be closed as a precursor of the following support ticket. 4) [condor-support #2158] LIGO: multiple schedd core dumps 6 crashes in 29 hours, and none in the last 39 hours after removing some grid home that where on hold in the queue. Nick and Zach will look into providing Caltech with a 7.0.0 pre-release to see if this porblem still exists in that code base given all the memory issues fixed since the 6.9.5 release with the Coverity tools. If the problem still exists it may be desirable to hold off on the 7.0.0 release until this is fixed. Recent Condor issues -------------------- 5) Set of 4 inter-related 6.9.5 issues: a) condor_startd crashes on startup (support 2146) This was fixed in a post-6.9.5 binary patch that has been runnin stably at Caltech. b) condor_master core dumps after a few startd crashes (admin 17268) c) condor_ckpt_server is left orphaned (admin 14515) d) subsequent condor_maser restart gets stuck trying to restart ckpt server (admin 17266) after a work around for (support 1750) is in place. No action on b-c). 6) [condor-admin #17291] LIGO: job on hold without a reason No information. Condor-DAGMan ------------- 7) Stager starting large DAGs that have expensive transient start up costs. testing DAGMAN_SUBMIT_DELAY to solve this problem at UWM? No information Previous Condor issues ---------------------- 8) [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop 9) [condor-admin #17237] LIGO: remote file IO despite WantRemoteIO = FALSE 10) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Enhancing the standard universe the same as the vanilla universe Still open question regarding adding to stable branch? 11) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor 12) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results 13) [condor-admin #17159] LIGO: fcntl 64bit bug How does this relate to the major glibc update for CentOS5 support? 14) [condor-admin #17143] LIGO: ImageSize update problem Peter investigating what it would take to have the same level and frequency of reporting for the Standard Universe as for the Local. Corollary--if this is too much for the standard it is probably too much for vanilla and how do we scale back on large pools. 15) [condor-admin #17225] LIGO: schedd log file management 16) [condor-admin #16017] LIGO: condor_q analyze support for Local Universe No information on 9-16). Condor-C -------- 17) Any update on multi-homed condor-c setup at UWM. No information. Condor-misc ----------- 18) Status/schedule of RHEL5/CentOS5 full port. Still estimating mid-Jan? Significant progress continues to be made and checkpointing in condor is now working (but not in standalone mode). 19) 7.0 release status/schedule. New release candidate will be installed on UW pools on Monday and if all goes well be released as 7.0.0 next Wed (1/9). However, see comments regarding agenda item 4) above. Delayed tasks for next development branch ----------------------------------------- A) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. B) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically C) [condor-admin #14006] LIGO: append to stdout/err files on re-execution D) [condor-admin #15287] LIGO X509 certificate management enhancement request. E) [condor-admin #17092] LIGO: Local universe scheduling latencies F) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan G) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. H) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Partial "dir" fix has been confirmed to work I) [condor-support #1663] LIGO: condor_submit problem and a couple of feature requests several fixed-size buffer issues have been fixed and the ball is in Duncan's court to re-initiate Parallel Universe issues if any remain. J) [condor-support #1750] LIGO condor_starter core dumps Derek confirmed there are corner cases that would be nice to fix someday Not currently a priority since it has not happened for a year, but when it did, there where 10k core dumps a day.