LIGO Condor telecon October 08, 2007 Condor: Kent Wegner Zach Miller LIGO: Stuart Anderson Scott Koranda Duncan Brown VDT --- 1) I believe Ed Maros has generated a list of additional globus client libraries need to run the current LIGO TCLGlobus client applications. Where does this currently stand? Email from Alain Roy: VDT 1.8.1a contains a TclGlobus-Client package that meets his request. We haven't released 1.8.1a yet, but we expect to within a day. (We're just waiting resolution on one other issue.) While working on this, Ed made another request: using a pre-installed Tcl instead of using the one from the VDT. This has been deferred to the next VDT update. -alain New Condor points of interest ----------------------------- 2) Stager starting large DAGs that have expensive transient start up costs. Is staggered crontab start still the leading contender for a short-term fix? Yes, still the best contender, but need more time to look into it. Kent will try to get to it before the next call. 3) Managing multi-threaded applications. Peter, what was learned about current Condor capabilities? Working towards new system of using slots where a job can say it needs N slots on a machine. But far from ready. At moment can use startd policy so that a slot will stop accepting jobs once it has a job running on another slot, but it requires that a job is not running on the second slot or can be pre-empted. This is the Bologna batch system. New functionality will be on the next development branch. 4) [condor-users 10 Sep 2007 email] Local universe scheduling Peter, what did you discover about whether teh Negotiator really needs to be involved in scheduling these jobs? Will check into it because didn't think Negotiator was involved in the local universe. Update: checked during call, do not need Negotiator. So Stuart will have someone try it again. Don't *need* negotiator, but it makes things go faster. Need to run condor_reschedule to start them. Schedd should be able to do this by itself. Condor-DAGMan ------------- 5) [condor-admin #15811] LIGO: intra-DAG node prioritization and throttling Coloring nodes within a DAG for priority and "maxjobs" Working but not fully documented yet. If we want a beta version could get something to us soon. Prioritization has been in code for a while (but not in 6.9.4), and now have coloring. Stuart would like to get a beta for x86_64 RHEL 3 (since no FC4). 6) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Pete Keller to work on this, but no update right now. 7) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs Nesting DAGs, and if child fails then run top level doesn't automatically run child rescue DAG. Kent will work on that next once done with throttling. Issue could be preserving all configurations used to run a DAG in rescue DAG adds complications, so Kent is thinking about it. Merging is higher priority says Duncan, but that is on Peter's plate. 8) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan Kent has not had time to do anything on that. Problem is when releasing has to create all internal state. When have few hundred-thousand nodes it takes a while. Have not looked again after 6.9.4. Condor-C -------- 9) Any update on multi-homed condor-c setup at UWM with 6.9.4. Note, the Caltech pool was upgraded to 6.9.4. Tried to move UWM pool over to GSI authentication but found two bugs. First is in wildcard handling of authorization, second is problem causing the starter to be killed. So backed it out. Host authorization is not moving away. But it is not considered a good security practice. Right now shared secret only works with daemon to daemon. Scott should file a bug report on the wildcard problem. Derek already fixed the backfill problem (edge case). Scott will look for log snippet of problem having to do with stard crashing. Condor-misc ----------- 10) 6.9 documentation on upgrading Quill is insufficient so we as a group failed when dropped Erik's previous request for a migration guide. Caltech could not upgrade Quill and we dropped the ball on planning by not working harder on the migration. Erik Paulson has done some work and it is better, and probably not wrong, but not complete. Greg Thain is tasked with making it better. 11) Any problems with clipped port to RHEL5 or initial work on full port? In progress, still not done. Hoping to do this with full support from Peter but have not upgraded LIGO clusters yet. 12) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(10^5-10^6) copies of executables. No update. 13) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically No update. 14) [condor-admin #14006] LIGO: append to stdout/err files on re-execution No update. 15) [condor-admin #15287] LIGO X509 certificate management enhancement request. No update.