Shutting Down a Condor Pool Gracefully
It is possible to shutdown a Condor pool gracefully so that a minimal amount of work is lost.
For jobs that run in the Standard universe and which can checkpoint this procedure will allow very little work to be lost.
Jobs that run in the Vanilla universe and so cannot checkpoint will lose any work they have accumulated, but with this procedure they will stay in the queue and restart when the pool is brought up again.
- Determine your HOSTALLOW_ADMINISTRATOR machine
Examine the condor_config for your pool and determine the machine or machines listed in the HOSTALLOW_ADMINISTATOR option. This is the machine from which you should run most of the commands that follow.
At UWM, for example, this machine is condor.phys.uwm.edu. It is also the central manager for the Condor pool.
- Create a script to list worker nodes
It is handy to have a simple script that lists the hostnames of the cluster nodes, one name per line. For example at UWM we have
[root@condor tools]# ./loopNamesOnly | head medusa-slave001 medusa-slave002 medusa-slave003 medusa-slave004 medusa-slave005 medusa-slave006 medusa-slave007 medusa-slave008 medusa-slave009 medusa-slave010
- Tell jobs to checkpoint if possible
Before shutting down the pool run the following command on the HOSTALLOW_ADMINISTATOR machine:
/opt/condor/sbin/condor_checkpoint -all
This will cause a signal to be sent to all running jobs (in the Standard universe) that they should checkpoint.
You can monitor the progress on any individual node by logging into it and watching the CkptServerLog and StarterLog in the Condor log directory.
- Bring down the worker nodes gracefully
Using your worker-node name script and condor_off on the HOSTALLOW_ADMINISTRATOR machine shut down the worker nodes gracefully. Do NOT shutdown the nodes that people submit jobs from yet, just the worker nodes.
For example, at UWM we do
./loopNamesOnly | xargs -i /opt/condor/sbin/condor_off -name {} -master -gracefulThe -master option tells Condor to shut down all the daemons on the node including the condor_master.
- Bring down the submit nodes
If you have a submit node or nodes (the machines on which people actually run condor_submit) and each is distinct from the central manager of your pool then read on. If your submit node is your central manager then skip to the next step.
Again from the HOSTALLOW_ADMINISTRATOR machine run condor_off to shut down each submit machine. For example at UWM we would do
/opt/condor/sbin/condor_off -name hydra -master -graceful /opt/condor/sbin/condor_off -name contra -master -graceful /opt/condor/sbin/condor_off -name nest -master -graceful
- Bring down the central manager
The central manager manger should be the last machine on which Condor is running. You can shut it down in a couple of ways, depending on how your cluster is set up:
/etc/init.d/condor stop or /opt/condor/sbin/condor_off -master or kill -QUIT pid
where pid is the pid for the condor_master daemon.
- Save state if necessary
If you are upgrading or doing other work and want to save the state of your pool (jobs, job history, user priorities, ...) then on the central manager and each submit machine be sure to save the contents of the execute and spool directories along with the file permissions (tar 'em up is best).
After you have the new or upgraded installation in place copy the contents from the old directories that you saved into the new directories, being sure to save the permissions (cp -a is good).
If you are upgrading the nodes and if the nodes act as their own checkpoint servers (like at UWM), then be sure to save the contents (the checkpoints!) of the checkpointing directory. (At UWM the checkpointing directory is outside of the Condor directory in /checkpoint.)
- Start up the pool
To start up the pool again just work in reverse.
First start Condor on the central manager.
Next, if you have submit machines that are distinct from your central manager, start those up. You should see jobs preserved if you do a condor_q.
Lastly, start Condor on all the worker nodes.
$Id: condor-shutdown.html,v 1.3 2007/11/06 03:42:12 patrick Exp $