LIGO Condor telecon November 02, 2007 Condor: Kent Wegner Todd Tannenbaum Peter Keller Zach Miller LIGO: Stuart Anderson Scott Koranda VDT --- 1) VDT and PYTHONPATH issue now reported as VDT-Environment package line 26-28, http://vdt.cs.wisc.edu/vdt_181_cache/VDT-Environment.pacman 2) Schedule for 1.8.1b patch to support conditionally using native TCL? 3) OS X support for swig No principals available so no updates. New Condor points of interest ----------------------------- 4) [condor-admin #17159] LIGO: fcntl 64bit bug Peter just looked at ticket and has suspicion of what is wrong, will get to it soon. 5) [condor-admin #17147] LIGO: Incorrect DAG execution when file server rebooted After a file server reboot Condor reported job as completed and DAGman went on, but that wasn't the case, the job was not really finished (so user claimed). Kent looked and not really DAGman error. Error in Condor makes log say job had successfully completed. Condor call in starter to rename file, an error happened, but still exited saying everything was ok? The "failed to rename" messages refer to core files. Happened at a time the file server was stuck and got rebooted, and is happening over NFS. Starter always checks for a core file. It looks like the job did run for over an hour. Is there nothing really wrong? But there was an NFS hang... It was a Python script. The python interpreter does not return 0 if unhandled exception. 6) [condor-admin #17143] LIGO: ImageSize update problem These are standard universe jobs. 24 hours is the Caltech checkpoint interval. When initially submit, the size of executable is the imagesize. But since periodic checkpoint is so long, it never gets updated. The update to imagesize only happens on checkpoint. The policy on imagesize may be honored but that is not being reflected in the log file. Would have to look in detail. Please confirm that starter is honoring policy on a 5-minute refresh. Might be nice to have Condor get the information back to be displayed by condor_q, but that would be lower priority enhancement request if it is not too hard. 7) [condor-admin #17136] LIGO: condor_run intermittently Some users like to run it. Makes assumption on shared file system and when shadow writes to log file, the output to the job file is also immediately available. Sent in couple-line patch to condor_run. Not sure if ok to change semantics, but at very least could have command-line argument. 8) [condor-admin #17092] LIGO: Local universe scheduling latencies Zach, any update on what it would take to avoid blocking on the Negotiator? Pretty well understood. Can't give a good workaround other then periodically running condor_reschedule. Will consider fixing for a later version. DAGman sends them in bursts, but the effective throughput was substantially slowed down. The only workaround is to issue condor_reschedule to trigger the scheduling. Expect something to happen in the next development series. 9) [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 What is the current recommendation for MAX_PID_COLLISION_RETRY in 6.9? Need for this in 6.9.x not as high because schedd better able to keep up, but the fundamental problem is still there. So keep the option for now. 10) [condor-support #1789] LIGO: Condor re-running old jobs Have the remaining corner cases mentioned in this resolved ticket been resolved in 6.9.x? Need to ask Derek...Derek doesn't know, will have to look back and see if and how was fixed. Condor-DAGMan ------------- 11) Stager starting large DAGs that have expensive transient start up costs. Any update? No updates, just wanted to ask if DAGman submit delay would help? Yes, probably a bit useful Scott says. Right now it would be for every node in a DAG, but it could be tied to colored nodes in a DAG. 12) [condor-admin #15811] LIGO: intra-DAG node prioritization and throttling Coloring nodes within a DAG for priority and "maxjobs" Steve, Duncan any feedback on testing this? No feedback from Steve and Duncan. 13) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Steve, Duncan any feedback on Peter's request for comment on whether or not to use phantom nodes to accomplish this? No feedback from Steve and Duncan. 14) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Steve, have you confirmed/started using the partial "dir" command fix? No feedback from Steve and Duncan. Kent will try to get a new DAGman out to Stuart. 15) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan Not in 6.9.5, so pushed into next development brach (7.1 or whatever that turns out to be). Condor-C -------- 16) Any update on multi-homed condor-c setup at UWM with 6.9.4. Note, all US LIGO pools are now running 6.9.4. No update. Condor-misc ----------- 17) Status of RedHat/Condor collaboration. See below. Todd cannot talk too much about RedHat release plans. 18) Status of RHEL5/CentOS5 port (clipped and full). Checked in the clipped port. Slowly moving forward on the full support. Working on it. Requirement for stable release? No, not a requirement, but it doesn't have to wait for the next development release since it is a "port" to a stable release. 19) 6.9.5 release status/schedule. Slipped a bit, but hopefully not too much. Confident have stable series by the end of the year. Todd distracted by some other things and that preventing next development series. Change to license is coming, prompted by partnership with RedHat. License will be dual- under Condor public license and GPL v2. Nothing changes unless we want it too. Shooting for end-of-year release with RedHat to include source packages. 6.9.5 is the 7.0 release candidate. If had to guess, more like 2 weeks away. Delayed tasks for next development branch ----------------------------------------- A) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. B) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically C) [condor-admin #14006] LIGO: append to stdout/err files on re-execution D) [condor-admin #15287] LIGO X509 certificate management enhancement request.