Summary of role's for slave nodes
services running on slave nodes - NFS, condor, LDAS
NFS - each node contains some amount of LIGO data*
condor - each node is a condor worker node
LDAS - each node is an LDAS worker node**
Working on slave nodes
stopping condor, on each node
1. /etc/init.d/condor stop
DOES NOT CHECKPOINT JOB!
2. One of these is preferable, needs testing
a./opt/condor/bin/condor_vacate ; /etc/init.d/condor stop
OR
b./opt/condor/sbin/condor_off
3. either of option 2 is desired, each causes jobs to
checkpoint. 2a results in all condor services being
stopped. 2b results in all but condor_master being
stopped, after which node can be remotely started.
2a sequence - loopRangebg condor_vacate, then condor stop.
Do work. loopRangegbg condor start.
2b sequence - from a master (or any condor pool member?)
condor_off {hostname or all}. Do work. condor_on
{hostname or all}.
4. on restart, does slave node join condor pool?
on the node, running the command
"condor_medusa |grep `hostname -s`" will return output
(consisting of hostname, status w/ regards to condor,and
load) if the machine has joined the pool; it will return
nothing if the node has not joined to pool.
Shutting down nodes makes frame files unavailable! Avoid doing this
if jobs are running!!!
*As all nodes contain LIGO data, each is an NFS server, for LIGO jobs to run,
ALL nodes should theoretically be up!
**Not necessarily true, will not necessarily be true long term
|