LSC Data Grid (6 sources) Load

Navigation

General Information
LSC LIGO Scientific Collaboration
LIGO-Caltech
LIGO-Hanford Observatory
LIGO-Livingston Observatory

DataGrid Details

What is LSC DataGrid?
LDG Clusters Usage [Ganglia]
Available Data per site
Grid Service Details [Monitoring]

User Manual

How to get started
Install Data Grid Client
Getting Certificates
Account Request
SSH Login Portal
CVS/Bug Account Request
Request Software changes to SCCB

Admin Manual [(*) = optional]

Install DataGrid Server
Get server certificates
Configure/deploy Condor
Include site into Grid Monitoring
Graceful Condor shutdown
(*) Configure/deploy CondorView
(*) Configure Condor Flocking
(*) CondorC on LDG
LAMS / VOMS Admin [LSC internal]
Syracuse X4500 wiki [passwd required]
Edit these web pages

Request/Bug Tracking

Request Tracking System [RT]
LDG trouble ticket system

Policy

Reference O/S Schedule

LDG Collaborations

Condor-LIGO biweekly telecon
Globus-LIGO monthly telecon
LIGO VO in Open Science Grid [OSG]
Archival GriPhyN-LIGO WG pages

Exits

OSG

Shutting Down a Condor Pool Gracefully

It is possible to shutdown a Condor pool gracefully so that a minimal amount of work is lost.

For jobs that run in the Standard universe and which can checkpoint this procedure will allow very little work to be lost.

Jobs that run in the Vanilla universe and so cannot checkpoint will lose any work they have accumulated, but with this procedure they will stay in the queue and restart when the pool is brought up again.

  1. Determine your HOSTALLOW_ADMINISTRATOR machine

    Examine the condor_config for your pool and determine the machine or machines listed in the HOSTALLOW_ADMINISTATOR option. This is the machine from which you should run most of the commands that follow.

    At UWM, for example, this machine is condor.phys.uwm.edu. It is also the central manager for the Condor pool.



  2. Create a script to list worker nodes

    It is handy to have a simple script that lists the hostnames of the cluster nodes, one name per line. For example at UWM we have

    [root@condor tools]# ./loopNamesOnly | head
    medusa-slave001
    medusa-slave002
    medusa-slave003
    medusa-slave004
    medusa-slave005
    medusa-slave006
    medusa-slave007
    medusa-slave008
    medusa-slave009
    medusa-slave010
    
  3. Tell jobs to checkpoint if possible

    Before shutting down the pool run the following command on the HOSTALLOW_ADMINISTATOR machine:

    /opt/condor/sbin/condor_checkpoint -all
    

    This will cause a signal to be sent to all running jobs (in the Standard universe) that they should checkpoint.

    You can monitor the progress on any individual node by logging into it and watching the CkptServerLog and StarterLog in the Condor log directory.



  4. Bring down the worker nodes gracefully

    Using your worker-node name script and condor_off on the HOSTALLOW_ADMINISTRATOR machine shut down the worker nodes gracefully. Do NOT shutdown the nodes that people submit jobs from yet, just the worker nodes.

    For example, at UWM we do

    ./loopNamesOnly | xargs -i /opt/condor/sbin/condor_off -name {} -master -graceful
    

    The -master option tells Condor to shut down all the daemons on the node including the condor_master.

  5. Bring down the submit nodes

    If you have a submit node or nodes (the machines on which people actually run condor_submit) and each is distinct from the central manager of your pool then read on. If your submit node is your central manager then skip to the next step.

    Again from the HOSTALLOW_ADMINISTRATOR machine run condor_off to shut down each submit machine. For example at UWM we would do

    /opt/condor/sbin/condor_off -name hydra -master -graceful
    /opt/condor/sbin/condor_off -name contra -master -graceful
    /opt/condor/sbin/condor_off -name nest -master -graceful
    
  6. Bring down the central manager

    The central manager manger should be the last machine on which Condor is running. You can shut it down in a couple of ways, depending on how your cluster is set up:

    /etc/init.d/condor stop
    
    or
    
    /opt/condor/sbin/condor_off -master
    
    or 
    
    kill -QUIT pid
    

    where pid is the pid for the condor_master daemon.

  7. Save state if necessary

    If you are upgrading or doing other work and want to save the state of your pool (jobs, job history, user priorities, ...) then on the central manager and each submit machine be sure to save the contents of the execute and spool directories along with the file permissions (tar 'em up is best).

    After you have the new or upgraded installation in place copy the contents from the old directories that you saved into the new directories, being sure to save the permissions (cp -a is good).

    If you are upgrading the nodes and if the nodes act as their own checkpoint servers (like at UWM), then be sure to save the contents (the checkpoints!) of the checkpointing directory. (At UWM the checkpointing directory is outside of the Condor directory in /checkpoint.)

  8. Start up the pool

    To start up the pool again just work in reverse.

    First start Condor on the central manager.

    Next, if you have submit machines that are distinct from your central manager, start those up. You should see jobs preserved if you do a condor_q.

    Lastly, start Condor on all the worker nodes.

Supported by the National Science Foundation. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)
$Id$