LIGO Condor telecon February 29, 2008 Condor: Todd Tannenbaum Peter Keller Kent Wenger LIGO Scientific Collaboration: Stuart Anderson Duncan Brown Steffen Grunewald Greg Mendell VDT --- 1) Any open issues? Greg: VDT is adding support for Debian Etch and keeping it for KX509. An email from Kent Blackburn says VDT is working on these for the next release. [Post meeting note from Greg: Kent's email forwards an email from Alain Roy saying they they are on track to add Debian Etch support. KX509 is broken after allowing Globus to link against external OpenSSL, and "it may take quite a while to fix".] Condor ------ 1.5) Any feedback on the prioritized LIGO list? Duncan: Support for sub-dags. Merging of subdags, adding a key word to make handling of sub-dags cleaner. Condor: That's pretty high on Stuart's list. Stuart: Anything at Hannover? Steffan: Nothing at the moment. We had a problem with a Dag job that used a lot of memory. Condor-DAGMan ------------- 2) Stager starting large DAGs that have expensive transient start up costs. testing DAGMAN_SUBMIT_DELAY to solve this problem at UWM or AEI? Any report from Mathew Pitkin? Stuart: Now on Item 2. The problem of starting a large number of jobs simultaneously? 3) [condor-admin #15848] LIGO: enhancement request to dagman Problems with rescue DAGs when running with sub-DAGs. Progress on PR 598 and 788? Duncan: time scale of Alpha vs Beta? Condor: Within a week... Duncan: Wait until Monday afterwards, can then discuss at Tues. inspiral telecon. Duncan: Fixing the rescue dag thing is the highest priority. Greg: Can condor please introduce themselves: Stuart: We have Todd Tannenbaum working item 3, Pete Keller, working item 5, Kent Wanger working on item 4 4) [condor-admin #16010] LIGO: condor_hold shouldn't kill DAGMan 5) [condor-admin #15836] LIGO: enhancement request to dagman Merging of sub-dags for easier management by DAGMan and users. Condor-release -------------- 6) Any problems reported in the wider condor universe that may be of interest to LIGO? 7) Initial reports on 7.0.1 testing. Stuart: Condor 7.0.1 came out Todd: Checkpoint server to fix in 7.0.2. Duncan: Did clean sweep with 7.0.1. Todd; Make sure check point server running 64 bit. Test 7.02 beta? Duncan, Stuart: We are homogenous, so no. Stuart: Redhat support for standard universe? Todd: Will happen. No time table right now. Redhat is hiring. 7.5) [condor-admin #17092] LIGO: Local universe scheduling latencies. Is this resolved by 7.0.1? 8) Packaging plans (/lib64 and root location, /opt/condor-x.y.z, /usr, Is it required to set LIB when using system default locations? Stuart: Steffan's question, would it harm, setting the LIB variable? [Here is Steffen's email:] Steffen Grunewald On Fri Feb 29, 2008 wrote: > On Thu, Feb 28, 2008 at 09:59:33PM -0800, Stuart Anderson wrote: >> Our next LIGO/Condor meeting is Friday Feb 29 at 11AM PST/1PM CST. >> The usual phone number applies: 1-800-704-9896 (562487#). > > OK - I will try to join via Skype from home (just tested the audio path, > and it seems to be OK this time - I suspect there have been conflicts > between Skype's way to open audio devices, and other programs. > >> VDT >> --- >> 1) Any open issues? > > for VDT toolkit, avoid extra inclusions of /usr/{bin,sbin} if Condor (or > other parts) have been placed under /usr (which would prevent the gsi- > enabled ssh to be first in $PATH) > >> 8) Packaging plans (/lib64 and root location, /opt/condor-x.y.z, /usr, >> Is it required to set LIB when using system default locations? > > Would it harm? > >> 17) [condor-admin #17291] LIGO: job on hold without a reason. > > What has happened to this one? After upgrading from 6.9.3, I haven't seen > this anymore, but there also has been a change in the user base... > > Cheers, > Steffen Condor: LIB variable set in condor config. Are you asking us to set it during the install or...? Stuart: Steffan, do you have a question? Steffan: Is it required to test for lib64 location? Stuart: Condor bundled version of ld fails to find some stuff. Used condor LIB setting. Will condor put stuff in the canonical location? Condor: Leaving up to Redhat and rpm. Stuart: If in standard place, do you need to set condor LIB variable? Condor: Yes, but ld is just a script you can look at. Stuart: Not asking for change, just trying to understand. 9) How to avoid thousands of jobs "flushing through a funky node"? e.g., if a job fails on a slot consider automatically releasing the machine claim for that user? Hoping for Todd's example startd policy expression to black-list individual users on "black hole machines". Stuart: Progress on Item 9, Black Hole machine? Todd: Will send that around. Previously delayed tasks for next development branch ---------------------------------------------------- 10) [condor-admin #15277] LIGO DAGMan spool directory efficiency. How to avoid making O(105-106) copies of executables. 11) [condor-admin #15669] LIGO RFE to optionally delete stdout/stderr files automatically. 12) [condor-admin #14006] LIGO: append to stdout/err files on re-execution. 13) [condor-admin #15287] LIGO X509 certificate management enhancement request. 14) [condor-admin #17092] LIGO: Local universe scheduling latencies. Recent Condor issues ----------------- 15) [condor-support #2158] LIGO: multiple schedd core dumps Should be closed. 17) [condor-admin #17291] LIGO: job on hold without a reason. 18) [condor-admin #17283] ancillary suggestion for a crash report tool. [Not exactly sure which topic these comments go with:] Todd: Fix for condor history, -contraint. If using quill with condor history. Stuart: Not using quill now, but will go back to using it. Todd: Condor history -backwords, condor history -contraint will work in next release. Condor-C -------- 19) Any update on multi-homed condor-c setup at UWM. Previous Condor issues ---------------------- 20) [condor-admin #17168] LIGO: Shadow failures to connect to schedd Todd to consider a short term fix for the stable branch while Tristan starts work on a completely new Shadow that supports all Universes equally well for the development branch. Todd: Condor admin shadow failures; fails to connect to schedd. Happens often or rarely, depending. Send Schedd log and Shadow log when it happens again. Have a guess why it happens. If shadow fails to connect to Schedd, waits and tries again; in old Shadow kind of does same thing. Keeps trying for a certain amount of time (default 300 s) it tries 2 connects. Does it ring a bell? Stuart: No, not the number of times it tried to connect Todd: Easy thing is to not hard code to 5 minutes, but make is a parameter. The longer you set it the more patient it will be. Down side is shadow sits waiting around. That would be the easiest thing to do. Stuart: Communication, instead of waiting until the end. Schedd can start the shadow. Front load as much as you can. Todd: Right. We do not keep communication between the Schedd and the Shadow open; was to establish key. Authorization failed, cannot establish a session key. Burned people at Fermi lab; Kerberos server got overloaded. Easiest thing would be to increase the number from 5 minutes to a longer period of time, and getting a full log when it happens again. Stuart: When the filesystem at hand talks to home userID gets changed; blocks Schedd if someone is abusing the filesystem. Todd: Schedd falls behind if too many 1 s jobs, or if writing user logs etc. and system unresponsive. Stuart: Can 2nd case be addressed? Todd: Shadows cannot get in a word. Startd not responding, time outs, addressed by making communication non-blocking. Schedd uses a socket and blocks. A patch will address. Work around, not lock user log before writing to it. Not a problem if every job has its own user log. Dagman jobs can run to one user log, or one user log per job. Stuart: Duncan? Greg: Think there is a separate log for each job Todd: Cannot look at a job class ads and see the history of the job. Condor: For recovery mode. Scaling... Stuart: What if dagman only one that had the lock to the file, when all writing to same log? Todd: Patch with a few corner case to be ironed out, that obtains file locking locally. Shadow, Schedd attempt to lock a file on NSF, happens locally in /tmp. Greg: OK, my DAGs write to single log, on the local file system. Condor: Should be OK. Greg: Summarize the problem? My DAGs work OK. Stuart: Jobs finish, but communication fails and Schedd restarts the job. Greg: Maybe I have seen that. Stuart: Probably should not speculate. Todd: Do you want to hold off until we see a complete log? 21) [condor-admin #17237] LIGO: remote file IO despite WantRemoteIO = FALSE 22) [condor-admin #17159] LIGO: fcntl 64bit bug Should be closed. 23) [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop 24) [condor-admin #17219] LIGO: stdout occasionally lost for jobmanager-condor 25) [condor-admin #17136] LIGO: condor_run intermittently returning NULL results 26) [condor-admin #17143] LIGO: ImageSize update problem Peter investigating what it would take to have the same level and frequency of reporting for the Standard Universe as for the Local. Corollary--if this is too much for the standard it is probably too much for vanilla and how do we scale back on large pools. Stuart: Keep going on delta update.. Todd: ... That's about it for my delta... Stuart: Anything else? OK, see you 2 weeks.