LIGO Condor telecon December 14, 2007 Condor: Kent Wenger Peter Keller Zach Miller LIGO: Stuart Anderson Scott Koranda VDT --- 1) Any remaining issues regarding LIGO requests and their relative priority? Note, there are two independent threads to this: a) LIGO production grid (LDG: LIGO Data Grid) discussed in this forum b) LIGO research into using OSG discussed in the LIGO-Pegasus call and other OSG forums. Alain not on call. See email list for an earlier response. VDT-CONDOR ---------- 1.5) Distributing Condor in / rather than /opt/condor and impact on VDT. When will make transition so that RPM installs go into /usr or something other than /opt? Nobody on call has info. New Condor issues ----------------- 2) Set of 4 inter-related 6.9.5 issues a) condor_startd crashes on startup (support 2146) b) condor_master core dumps after a few startd crashes (admin 17268) No patch yet. c) condor_ckpt_server is left orphaned (admin 14515) Nick LeRoy is looking into it, not a high priority if other issues are fixed. Thinking of rewriting it from scratch but that will not happen for a while. For now the old code base will just be evolved. d) subsequent condor_maser restart gets stuck trying to restart ckpt server (admin 17266) after a work around for (support 1750) is in place. Zach will ping Dan and have him send a patch to Stuart. 3) [condor-admin #17283] LIGO: condor_procd failure Plan is to avoid GID tracking for now since this enables PROCD which is not yet as stable as the old scheme. Tried using GID tracking but this exposed the bug, so LIGO will not try to use it for now. Will go ahead with 7.0 even if procd has some existing bugs. Procd is necessary for the privilege separation featuers. 4) [condor-admin #17291] LIGO: job on hold without a reason Sounds like a bug. Assigned to Todd right now? Stuart will leave the jobs in the queue for a while though Zach doesn't think they will need them. 5) [condor-admin #17258] LIGO: New 6.9.5 job ClassAd counters missing from documentation Closed out. Documentation was updated. Condor-DAGMan ------------- 6) Update on discussion of DAGMan development from last week: * Kent's idea of deterministic rescue dag naming for automatic discovery * Peter and Duncan in agreement on syntax for splicing? No objections to post by Kent on condor-users. Doubtful many users messing around with the naming conventions for rescue dags. Peter has what he needs and the work is about 1/3 finished in the code base. He has been focused on RH5 work. 7) Stager starting large DAGs that have expensive transient start up costs. testing DAGMAN_SUBMIT_DELAY to solve this problem at UWM? No update from Scott. 8) [condor-admin #15811] LIGO: intra-DAG node prioritization and throttling Now deployed across the LDG, close ticket? Can close this out. It is being used and tested. Previous Condor issues ---------------------- 9) [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop No update. 10) [condor-admin #17237] LIGO: remote file IO despite WantRemoteIO = FALSE No update. 11) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Enhancing the standard universe the same as the vanilla universe Still open question regarding adding to stable branch? No update. 12) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor No update. 13) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results Discussion of updated flock patch that preserves semantics. 14) [condor-admin #17205] LIGO: condor_rm cleanup of Local and Scheduler jobs Waiting for Greg Thain to close. This ticket can be closed. 15) [condor-admin #17159] LIGO: fcntl 64bit bug Peter, Vladimir have you closed the loop on this? Nothing back from Vladimir. Scott will ping him. 16) [condor-admin #17143] LIGO: ImageSize update problem Peter investigating what it would take to have the same level and frequency of reporting for the Standard Universe as for the Local. Corollary--if this is too much for the standard it is probably too much for vanilla and how do we scale back on large pools. No update. 17) [condor-admin #17226] LIGO: shadow assertion error in pseudo_ops.C Should be closed as un-reproducible? We can close this out and open a new one if something similar happens again. 18) [condor-admin #17225] LIGO: schedd log file management No update. 19) [condor-admin #17209] LIGO: reduce cost of catalog building If this made it into some development branch this can be closed. Closed. 20) [condor-admin #16017] LIGO: condor_q analyze support for Local Universe This is low priority. Condor-C -------- 21) Any update on multi-homed condor-c setup at UWM with 6.9.4. No update. Condor-misc ----------- 22) New LIGO pool up and running at Syracuse with CentOS 5 clipped port. 23) Status/schedule of RHEL5/CentOS5 full port. Still estimating 2 Months - fortnight for full glibc update? Got it linking with glibc 2.5 (standard for RH5) and using the default compiler that comes with RH5, so next step is the checkpointing and remote I/O. "Not a crusty port but an actual port". Maybe two weeks until finished? (Two business weeks). So maybe 4 weeks due to holidays. "Really cruising along...". 24) 7.0 release status/schedule. Still on track for end of 2007. Running on UW pool with priv sep, no showstoppers. Delayed tasks for next development branch ----------------------------------------- A) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. B) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically C) [condor-admin #14006] LIGO: append to stdout/err files on re-execution D) [condor-admin #15287] LIGO X509 certificate management enhancement request. E) [condor-admin #17092] LIGO: Local universe scheduling latencies F) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan G) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. H) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Partial "dir" fix has been confirmed to work I) [condor-support #1663] LIGO: condor_submit problem and a couple of feature requests several fixed-size buffer issues have been fixed and the ball is in Duncan's court to re-initiate Parallel Universe issues if any remain. J) [condor-support #1750] LIGO condor_starter core dumps Derek confirmed there are corner cases that would be nice to fix someday Not currently a priority since it has not happened for a year, but when it did, there where 10k core dumps a day. Other -------- Kent will let us know when there is a pre-release DAGman that is "worth looking at". That has worked out well for LIGO so comfortable with that approach.