LIGO Condor telecon November 16, 2007 Condor: Peter Keller Alain Roy Todd Tannenbaum LIGO: Stuart Anderson Scott Koranda Steffen Grunewald Greg Mendell Duncan Brown (late) Pegasus: Karan Vahi 0) Does this mailing list now provide a valid return address? Steffen has been told this is now working. Alain says that "lists" is still in the to, ie. condorligo@lists.aei.mpg.de, and this is verified by others. Alain will forward headers to Steffen. VDT --- 1) VDT support for OS X 10.5 (pyglobus, openssh bug) Support for Mac OS X 10.5 ot as important as Debian. 2) [condor-admin #12696] Stupid question? rpath set to /home/condor for VDT build of GT When Stuart runs 'strings' on Condor 6.9.4 still see /home/condor. Problem in Condor (not VDT) only seen when using SSL auth. Has been fixed. In the VDT, only got rid of this problem with Globus binaries, so still may be other problems, but need to check in the latest VDT release (1.8.1). New Condor points of interest ----------------------------- 3) [condor-admin #17205] LIGO: condor_rm cleanup of Local and Scheduler jobs Greg Thain confirmed in local universe and has fix going in asap. Not so sure will fix scheduler universe. What are the differences between local and scheduler? Local universe has a starter. Starter can monitor process tree and CPU usage and memory usage of job. Scheduler universe has no starter, so lighter weight, but hard to apply policies. Scheduler universe is for well known helpers like DAGman, not necessarily untrusted jobs. Local scheduler is more like vanilla. What is overhead to fork off starter jobs in local? Todd not sure. 4) [condor-admin #17209] LIGO: reduce cost of catalog building Nick's change might make it is sooner (possibly today?), since was also broken on Windows maybe, so might as well fix it now. Main point is that LIGO now has shadow binaries that don't build a catalog at all, so less important. 5) [condor-admin #17191] LIGO: LOCAL_UNIV_RENICE_INCREMENT RFE Greg came back with if-then-else solution that would work. Todd resistant to add knobs if generic knobs already work, so prefer to see LIGO use some kind of expression rather than another knob. Stuart agrees. 6) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Enhancing the standard universe the same was as the vanilla universe Fixed and closed, but fix only in vanilla universe. Need in standard as well. Todd not familiar, needs to investigate. Greg Quinn was working on that ticket. 7) [condor-admin #17159] LIGO: fcntl 64bit bug Update from Peter on his initial suspicion of what is wrong. Designed so that any fcntl call that requires more than 2 arguments will not work and is not supported. Condor standard universe captures calls into glibc, has support for fcntl(), but support is limited and does not support the full fcntl() functionality (not the 3rd argument). Peter can look at what it means to fix it. If Peter can update ticket, LIGO can ask Vladimir, and Peter can also update ticket with a possible work around. Stuart suggests more explicit documentation on this limitation. 8) [condor-admin #17143] LIGO: ImageSize update problem Confirmation that starter honors policy on a 5min timescale for the Standard universe. Is there a way to efficiently get the image size update back to the scheduler? Will require Pete to change the code. Not a priority for LIGO, just asking if worth doing. Would be convenient for debugging to get that information back. 9) [condor-admin #17136] LIGO: condor_run intermittently Discussion of updated flock patch that preserves semantics. Stuart sent in a different patch, wondering if anybody has looked at that yet. Todd cannot say? When Condor writes to .log it is not guaranteed that the flush has been done. 10) [condor-support #1789] LIGO: Condor re-running old jobs Derek was going to investigate whether the remaining corner cases mentioned in this resolved ticket been resolved in 6.9.4. Condor-DAGMan ------------- 11) Stager starting large DAGs that have expensive transient start up costs. Any update? Scott have you tested DAGMAN_SUBMIT_DELAY to solve this problem? Scott has not tested yet. 12) [condor-admin #15811] LIGO: intra-DAG node prioritization and throttling Coloring nodes within a DAG for priority and "maxjobs". An updated pre-release has been installed at CIT for testing. Steve, Duncan any feedback on this? Steve has tried these and found that they work and they are easy to use. Drew agrees that it does what he wants, is convenient, and works. 13) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Steve, Duncan any feedback on Peter's request for comment on whether or not to use phantom nodes to accomplish this? Have not heard back about this from Steve or Duncan. Peter will just do it anyway (both with and not with phantom nodes), not a big deal. For later development branch (not in 6.9.5). "Phantom" nodes will tie up the nodes in the sub-dags with parents and children in the higher-level dag. Peter sees how you would want both with and without this type of functionality. Duncan asked for something by 12/17 since visiting UWM, Peter says he will try. 14) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Steve, have you confirmed/started using the partial "dir" command fix? Steve has verified Not all of this finished, will be punted to next development branch. Condor-C -------- 15) Any update on multi-homed condor-c setup at UWM with 6.9.4. UWM has inspiral workshop and Scott is assigned a bunch of work. Condor-misc ----------- 16) Status of RHEL5/CentOS5 full port. Currently doing RHEL5 and it is compiling as of yesterday and today trying to test it. One known problem, but want to see if other problems exist before tackling that one. One known problem involves c++ runtime library and thread handling for exceptions. Proposed solution is to check on thread-local storage and see if it is useable, or can manage thread-local storage "manually"? Glibc is 2.3.2 from RedHat circa "a while ago", so has other revision numbers. Condor hasn't update for many years since so much effort, only do if that is the last option. 17) 6.9.5 release status/schedule. Todd hopes to have out before Thanksgiving...should have been on local pool already but is not. Trying to fix two known blockers today. Delayed tasks for next development branch ----------------------------------------- A) [condor-admin #15277] LIGO DAGMan spool directory efficiency. B) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically C) [condor-admin #14006] LIGO: append to stdout/err files on re-execution D) [condor-admin #15287] LIGO X509 certificate management enhancement request. E) [condor-admin #17092] LIGO: Local universe scheduling latencies F) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan