LSC Data Grid (6 sources) Load

Navigation

General Information
LSC LIGO Scientific Collaboration
LIGO-Caltech
LIGO-Hanford Observatory
LIGO-Livingston Observatory

DataGrid Details

What is LSC DataGrid?
LDG Clusters Usage [Ganglia]
Available Data per site
Grid Service Details [Monitoring]

User Manual

How to get started
Install Data Grid Client
Getting Certificates
Account Request
SSH Login Portal
CVS/Bug Account Request
Request Software changes to SCCB

Admin Manual [(*) = optional]

Install DataGrid Server
Get server certificates
Configure/deploy Condor
Include site into Grid Monitoring
Graceful Condor shutdown
(*) Configure/deploy CondorView
(*) Configure Condor Flocking
(*) CondorC on LDG
LAMS / VOMS Admin [LSC internal]
Syracuse X4500 wiki [passwd required]
Edit these web pages

Request/Bug Tracking

Request Tracking System [RT]
LDG trouble ticket system

Policy

Reference O/S Schedule

LDG Collaborations

Condor-LIGO biweekly telecon
Globus-LIGO monthly telecon
LIGO VO in Open Science Grid [OSG]
Archival GriPhyN-LIGO WG pages

Exits

OSG

Documentation of Pegasus related issues

Acronyms and terms

This section gives an overview of common acronyms used in the documentations for Pegasus.

  • LFN Logical FileName:
  • The name of file without regard to where it is located or any specific or definite path. Within the LIGO Data Grid we strongly encourage LFNs to be unique. Example: H2-TMPLTBANK-755943365-2048.xml

  • PFN Physical FileName:
  • The name of a specific file having a specific path, usually a URL. A file often has multiple PFNs. Example:

    file:/home/alex/H2-TMPLTBANK-755943365-2048.xml
    file://localhost/home/alex/H2-TMPLTBANK-755943365-2048.xml
    gsiftp://dietz.phys.lsu.edu/home/alex/H2-TMPLTBANK-755943365-2048.xml

    In the example above all 3 URLs or PFNs point to the same LFN and the same file residing on a disk. In the following example
    gsiftp://dietz.phys.lsu.edu/home/alex/H2-TMPLTBANK-755943365-2048.xml gsiftp://hydra.phys.uwm.edu/home/alex/H2-TMPLTBANK-755943365-2048.xml

    the two URLs or PFNs point to the same LFN but now each is found on two different filesystems located in different parts of the U.S.

  • LRC Local Replica Catalog.
  • This is a catalog that contains the knowledge of mapping logical filenames (LFN) to physical filenames (PFN).

  • RLS Repilca Location Service.
  • This service describes the set of RLI and one or more LRC's

  • LRC Local Replica Catalog
          This catalog contains knowledge for the LFN mapping.
  • Globus: 
  • Globus is an open source software toolkit used for building Grid systems and applications. It is used to transfer files between different locations and to run executables remotely on different sites. This service is actually used when running a job on a remote site.


Details on the file sites.txt

In this section the contents of the sites.txt file are explained in more detail. This file is used to give informations on the different sites (clusters) that are used, like the path to the working directory, paths to needed libraries and GridFTP server.
This config file has to be transformed to a XML version by using genpoolconfig:

genpoolconfig -f sites.txt -o sites.xml
This file contain informations for each site, as in the template below:

pool siteid {
    profile namespace "key" "value"
    gridlaunch "path_to_$VDS_HOME/bin/kickstart"
    lrc "URL_to_Replica_catalog"
    gridftp "GridFTP-server_URL_to_storage_location" "GT_Version"
    workdir "base_path_to_working_directory"
    universe type "jobmanager_URL" "GT_Version"
}

Example:

pool cit {
  profile env "GLOBUS_LOCATION" "/ldcg/ldg/globus"
  profile env "LD_LIBRARY_PATH" "/ldcg/ldg/globus/lib"
 
gridlaunch "/archive/home/dietz/Install/vds/bin/kickstart"
  lrc "rlsn://ldas-cit.ligo.caltech.edu"
  gridftp "gsiftp://ldas-grid.ligo.caltech.edu/archive/home/dietz/pegasus" "2.2.4"
  workdir "/archive/home/dietz/pegasus"
  universe transfer "ldas-grid.ligo.caltech.edu/jobmanager-fork" "2.2.4"
  universe vanilla "ldas-grid.ligo.caltech.edu/jobmanager-condor" "2.2.4"
}

The following is a short explanation of each of those lines:
  • profile env "GLOBUS_LOCATION" "/ldcg/ldg/globus"
    profile env "LD_LIBRARY_PATH" "/ldcg/ldg/globus/lib"

    Those lines are used to specify environment variables on the remote cluster. In this example it is equivalent to the commands:
export GLOBUS_LOCATION=/ldcg/ldg/globus
export
LD_LIBRARY_PATH=/ldcg/ldg/globus/lib
  • gridlaunch "/archive/home/dietz/Install/vds/bin/kickstart"
    This line specify the location of the kickstart executable that is used to start the job on the remote site.
  • lrc "rlsn://ldas-cit.ligo.caltech.edu"
    This line specifies the LRC to be used.
  • gridftp "gsiftp://ldas-grid.ligo.caltech.edu/archive/home/dietz/pegasus" "2.2.4"
    This line specifies the gridftp server and points to the permanent storage location available on this site.
  • workdir "/home/dietz/pegasus"
    Specifies the working directory on the remote site.

  • universe transfer "hydra.phys.uwm.edu/jobmanager-fork" "2.2.4"
  • universe vanilla "hydra.phys.uwm.edu/jobmanager-condor" "2.2.4" 
    Those lines specify what jobmanagers to use when running in the different universes (transfer universe used to tranfer data, vanilla universe used to run the jobs)


Details on the file tc.data

In this section I will explain the entries of the file tc.data. tc stands for transformation catalog, with a transformation meaning basically an executable. The file consist of six columns, which names are given in the following table:


siteID LogigalTX PhysicalTX Type SystemInfo Profiles

  1. SiteID: This is an identifier for the site as specified in the sites.txt file. It is the name of a site on which the executable is installed or available via GridFTP or http.
  1. LogicalTX: This is the logical name (LFN) of the transformation (executable).This name is written in the format:
NAMESPACE::NAME::VERSION

where version does not refer to the actual version of the executable. Example:

ligo::lalapps_inspiral::1.0
  1. PhysicalTX: This is the physical file name (PFN) or the transformation (executable). It is either a full path to that executable (if installed on the remote site) or a GridFTP path leading to the executable on another site. Examples:
/archive/home/dietz/LAL/bin/lalapps_inspiral
gsiftp://ldas-grid.ligo.caltech.edu/archive/home/dietz/LAL/bin/lalapps_inspiral
  1. Type: This specifies the type of the transformation (executable). At this time two types are supported:
INSTALLED: If the executable is installed on the remote site
STATIC_BINARY: If the executable can be transfered as a static binary from another site
  1. SystemInfo: This parameter contains the architecture, the OS and glibc version for which the transformation is compiled. The default is to use INTEL32::LINUX.
  1. Profiles: The profiles for a transformation can be defined in the format:
NAMESPACE::KEY="VALUE"

where to use double quotes for value.

The following shows an example of tc.data. The first two lines point to the place where the transfer and kickstart executables can be found on the remote machine, the next line specifies the RLS_Client that is needed always, and the last lines specify the location of the LALApps executables.

cit   transfer                    /archive/home/dietz/Install/vds/bin/transfer                      INSTALLED INTEL32::LINUX vds::bundle_stagein=1
cit   dirmanager                  /archive/home/dietz/Install/vds/bin/dirmanager                    INSTALLED INTEL32::LINUX
local RLS_Client                  /opt/ldg-3.5/vds/bin/rls-client                                   INSTALLED INTEL32::LINUX
local ligo::lalapps_tmpltbank:1.0 gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_tmpltbank STATIC_BINARY INTEL32::LINUX
local ligo::lalapps_inspiral:1.0  gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_inspiral  STATIC_BINARY INTEL32::LINUX
local ligo::lalapps_inca:1.0      gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_inca      STATIC_BINARY INTEL32::LINUX
local ligo::lalapps_thinca:1.0    gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_thinca    STATIC_BINARY INTEL32::LINUX



FAQ

This section is a uncomplete section that summarizes some frequently asked question on problems with running on the grid.

Problems while creating a concrete DAG with gencdag:
  • Some other sites.xml files is used, no the file that is specified in the properties file!
Make sure that the VDS_HOME variable is set correctly, then unset CLASSPATH and source the setup script again:
echo ${VDS_HOME}
unset CLASSPATH
source ${VDS_HOME}/setup-user-env.sh
  • Error: "Can't determine an location to transfer input file for lfn"
Check the path to the RLS server and also check you have only absolute path names and that the file exist!
  • Error: "Could not authenticate against any site. Probably your credentials were not generated or have expired"
Check your credential is valid. Check also that the credential is located in the directory /etc/grid-security/. Try to run a simple globus-job like this:
globus-job-run hydra.phys.uwm.edu/jobmanager-condor -l /bin/hostname
  • Error: "java.lang.OutOfMemoryError"
Set the following variables:
export VDS_JAVA_HEAPMIN 512
export VDS_JAVA_HEAPMAX 1024
  • Error: "org.globus.replica.rls.RLSException: IO timeout: globus_io_register_read() timed out after 30 seconds"
Check that the RLS server is up and running
Problems while the DAG is running:
  • There is no out-file returned to my local machine, so I cannot check the status of the job:
This problem may be related to a well-known problem with NFS-mounted devices and it may occur when the head node of the remote machine is loaded. Maybe this is related to some other problem.
  • Jobs seem to run for a very long time, but nothing happens:
This may be cause when kickstart cannot be accessed. So check the path to kickstart on the remote machine.


Where to go for help

If you have further questions or problems, you can take a look at the full documentation or mail the griphynligo mailing list.

Supported by the National Science Foundation. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)
$Id$