LSC Data Grid (6 sources) Load

Navigation

General Information
LSC LIGO Scientific Collaboration
LIGO-Caltech
LIGO-Hanford Observatory
LIGO-Livingston Observatory

DataGrid Details

What is LSC DataGrid?
LDG Clusters Usage [Ganglia]
Available Data per site
Grid Service Details [Monitoring]

User Manual

How to get started
Install Data Grid Client
Getting Certificates
Account Request
SSH Login Portal
CVS/Bug Account Request
Request Software changes to SCCB

Admin Manual [(*) = optional]

Install DataGrid Server
Get server certificates
Configure/deploy Condor
Include site into Grid Monitoring
Graceful Condor shutdown
(*) Configure/deploy CondorView
(*) Configure Condor Flocking
(*) CondorC on LDG
LAMS / VOMS Admin [LSC internal]
Syracuse X4500 wiki [passwd required]
Edit these web pages

Request/Bug Tracking

Request Tracking System [RT]
LDG trouble ticket system

Policy

Reference O/S Schedule

LDG Collaborations

Condor-LIGO biweekly telecon
Globus-LIGO monthly telecon
LIGO VO in Open Science Grid [OSG]
Archival GriPhyN-LIGO WG pages

Exits

OSG

Configuring and Deploying Condor

Update: Jan/10/2012


Most likely the default installation and configuration is not what is needed.

As you must already know, a machine running Condor can play (basically) 3 roles:

Below are instructions for deploying Condor onto your cluster using one particular method and configuration option or style, and making some basic assumptions [a cluster will be configured with 1 CM node, 1 condor submission node, and several execution nodes]. Condor is very flexible and so you may choose to install, configure, and deploy it in a variety of ways. If you find that the instructions below do not suit your needs please see the latest Condor manual.

  1. Install the rpm/deb condor package in ALL the machines of your cluster. The package will automatically add a 'condor' user/group if it does not exist already. Cluster sites with a specific security policy should add the 'condor' user/group manually before performing the installation.
  2. Choose the machine that will play the role of the CM, and edit the main configuration file /etc/condor/condor_config:
    • Set CONDOR_HOST to the FQDN [Full Qualified Domain Name] of the machine that will play the role of CM. If this machine has a seperate network interface just for access to the cluster nodes, use that FQDN or equivalent IP address. For example, if that machine has FQDN universe.sverige.kth.edu and IP=666.999.555.111, then set CONDOR_HOST=universe.sverige.kth.edu or to its equivalent IP address, CONDOR_HOST=666.999.555.111.
    • Set RELEASE_DIR = /usr
    • Set LOCAL_DIR = /var
    • Set LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
    • Set CONDOR_ADMIN to an appropriate email address
    • Set UID_DOMAIN to the subnet domain for your cluster. For example, at UWM a typical node has FQDN medusa-slave001.medusa.phys.uwm.edu so the subnet domain is medusa.phys.uwm.edu. This means, UID_DOMAIN=medusa.phys.uwm.edu
    • Set FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) if your cluster does NOT have a shared filesystem for users, or set it to the subnet domain if it does have a shared filesystem for users.
    • Set USE_NFS = True if your cluster has a shared filesystem for users.
  3. Create a tar file that you can deploy onto each node of your cluster, with the configuration file:
    tar -cf condor.tar /etc/condor/condor_config 
    
  4. Deploy the tar file onto each node of your cluster, making sure that the deployed file are in the correct path [this means, condor_config will overwrite the old installed config file].
  5. On each node execute chkconfig --add condor.
  6. Back on the CM machine, edit the file /etc/condor/condor_config.local to add the following:
    COLLECTOR_NAME = $(CONDOR_HOST)
    DAEMON_LIST    = MASTER, COLLECTOR, NEGOTIATOR
    
    where the CONDOR_HOST value is taken from the same variable in the /etc/condor/condor_config file.
  7. If this machine has a seperate network interface for the cluster nodes also add the line
    NETWORK_INTERFACE = your IP address
    
    which means the IP address for the seperate network interface.
  8. Choose a machine to be the condor-submission node, and edit /etc/condor/condor_config.local to add the following:
    COLLECTOR_NAME = $(CONDOR_HOST)
    DAEMON_LIST    = MASTER, SCHEDD
    
    where the CONDOR_HOST value is taken from the same variable in the /etc/condor/condor_config file.
  9. Choose the machines that will be the condor-execution nodes, and edit /etc/condor/condor_config.local to add the following:
    COLLECTOR_NAME = $(CONDOR_HOST)
    DAEMON_LIST    = MASTER, STARTD
    
    where the CONDOR_HOST value is taken from the same variable in the /etc/condor/condor_config file.
  10. Start Condor first on the CM machine, with:
    /etc/init.d/condor start
    
  11. Start condor on the condor submission node by doing the same.
  12. Start condor on the condor execution nodes by doing the same on each node.
  13. On the Condor Central Manager machine do /usr/bin/condor_status to see the status of your Condor pool. One must be able to see sth similar to
    Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
    
    nemo-slave0001.nem LINUX      X86_64 Claimed   Busy     1.000  3960  0+06:42:47
    nemo-slave0003.nem LINUX      X86_64 Claimed   Busy     1.000  3960  0+09:14:30
    nemo-slave0004.nem LINUX      X86_64 Claimed   Busy     1.000  3960  0+06:41:30
    nemo-slave0005.nem LINUX      X86_64 Claimed   Busy     1.000  3960  0+03:47:21
    nemo-slave0006.nem LINUX      X86_64 Claimed   Busy     0.990  3960  0+00:48:46
    nemo-slave0007.nem LINUX      X86_64 Claimed   Busy     0.990  3960  0+00:36:10
    nemo-slave0008.nem LINUX      X86_64 Claimed   Busy     0.990  3960  0+00:31:38
    nemo-slave0009.nem LINUX      X86_64 Claimed   Busy     1.000  3960  0+02:47:22
    nemo-slave0010.nem LINUX      X86_64 Claimed   Busy     0.990  3960  0+05:15:34
    nemo-slave0011.nem LINUX      X86_64 Claimed   Busy     0.990  3960  0+06:41:03
    ...
    slot1_2@nemo-slave LINUX      X86_64 Claimed   Busy     1.000    25  1+01:02:19
    slot1_3@nemo-slave LINUX      X86_64 Claimed   Busy     1.000    25  1+01:00:46
    slot1_4@nemo-slave LINUX      X86_64 Claimed   Busy     1.000    25  1+00:59:34
    slot1_7@nemo-slave LINUX      X86_64 Claimed   Busy     1.000    25  1+00:56:14
    slot1_8@nemo-slave LINUX      X86_64 Claimed   Busy     1.000    25  0+10:49:23
    slot1_9@nemo-slave LINUX      X86_64 Claimed   Busy     1.000    25  0+10:47:48
    slot2@nemo-slave11 LINUX      X86_64 Unclaimed Idle     0.270  48397  2+03:25:40
                         Total Owner Claimed Unclaimed Matched Preempting Backfill
    
            X86_64/LINUX  5171     0    4931       234       0          6        0
    
                   Total  5171     0    4931       234       0          6        0
    

This completes a basic deployment and configuration of Condor. You are strongly encouraged to read the Condor Manual and learn how to configure Condor is the best way for your particular cluster.



Supported by the National Science Foundation. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)
$Id$