Documentation of Pegasus related issues
Acronyms and terms
This section gives an overview of common acronyms used in the documentations for Pegasus.
- LFN Logical FileName:
The name of file without
regard to where it is located or any specific or definite path. Within
the LIGO Data Grid we strongly encourage LFNs to be unique. Example: H2-TMPLTBANK-755943365-2048.xml - PFN Physical FileName:
The name of a specific
file having a specific path, usually a URL. A file often has multiple
PFNs. Example:
file:/home/alex/H2-TMPLTBANK-755943365-2048.xml
file://localhost/home/alex/H2-TMPLTBANK-755943365-2048.xml
gsiftp://dietz.phys.lsu.edu/home/alex/H2-TMPLTBANK-755943365-2048.xml
In the example above all 3 URLs or PFNs point to the same LFN and the same file residing on a disk. In the following example
-
gsiftp://dietz.phys.lsu.edu/home/alex/H2-TMPLTBANK-755943365-2048.xml
gsiftp://hydra.phys.uwm.edu/home/alex/H2-TMPLTBANK-755943365-2048.xml
- LRC Local Replica Catalog.
This is a catalog
that contains the knowledge of mapping logical filenames
(LFN) to physical filenames (PFN). - RLS Repilca Location Service.
This service describes the set of RLI and one or more LRC's - LRC Local Replica Catalog
the two URLs or PFNs point to the same LFN but now each is found on two different filesystems located in different parts of the U.S.
- Globus: Globus is an open source software toolkit used for building Grid systems and applications. It is used to transfer files between different locations and to run executables remotely on different sites. This service is actually used when running a job on a remote site.
- Pegasus: Pegasus is a flexible framework that enables the mapping of comples scientific workflows onto the grid. It is used to check the different sites, check where the files and executables are available and from where they have to be transfered in order to create a concrete DAG, that can be started by Condor.
Details on the file sites.txt
In this section the contents of the sites.txt
file are
explained in more detail. This file is used to give informations on the
different sites (clusters) that are used, like the path to the working
directory, paths to needed libraries and GridFTP server.
This config file has to be transformed to a XML version by using genpoolconfig:
pool siteid {
profile namespace "key" "value"
gridlaunch "path_to_$VDS_HOME/bin/kickstart"
lrc "URL_to_Replica_catalog"
gridftp "GridFTP-server_URL_to_storage_location" "GT_Version"
workdir "base_path_to_working_directory"
universe type "jobmanager_URL" "GT_Version"
}
Example:
pool cit {
profile env "GLOBUS_LOCATION" "/ldcg/ldg/globus"
profile env "LD_LIBRARY_PATH" "/ldcg/ldg/globus/lib"
gridlaunch "/archive/home/dietz/Install/vds/bin/kickstart"
lrc "rlsn://ldas-cit.ligo.caltech.edu"
gridftp "gsiftp://ldas-grid.ligo.caltech.edu/archive/home/dietz/pegasus" "2.2.4"
workdir "/archive/home/dietz/pegasus"
universe transfer "ldas-grid.ligo.caltech.edu/jobmanager-fork" "2.2.4"
universe vanilla "ldas-grid.ligo.caltech.edu/jobmanager-condor" "2.2.4"
}
The following is a short explanation of each of those lines:
- profile env "GLOBUS_LOCATION" "/ldcg/ldg/globus"
profile env "LD_LIBRARY_PATH" "/ldcg/ldg/globus/lib"
Those lines are used to specify environment variables on the remote cluster. In this example it is equivalent to the commands:
export LD_LIBRARY_PATH=/ldcg/ldg/globus/lib
- gridlaunch
"/archive/home/dietz/Install/vds/bin/kickstart"
This line specify the location of the kickstart executable that is used to start the job on the remote site.
- lrc "rlsn://ldas-cit.ligo.caltech.edu"
This line specifies the LRC to be used.
- gridftp
"gsiftp://ldas-grid.ligo.caltech.edu/archive/home/dietz/pegasus"
"2.2.4"
This line specifies the gridftp server and points to the permanent storage location available on this site.
- workdir "/home/dietz/pegasus"
Specifies the working directory on the remote site.
- universe transfer "hydra.phys.uwm.edu/jobmanager-fork" "2.2.4"
- universe vanilla
"hydra.phys.uwm.edu/jobmanager-condor"
"2.2.4"
Those lines specify what jobmanagers to use when running in the different universes (transfer universe used to tranfer data, vanilla universe used to run the jobs)
Details on the file tc.data
In this section I will explain the entries of the file tc.data. tc stands for transformation catalog, with a transformation meaning basically an executable. The file consist of six columns, which names are given in the following table:
| siteID | LogigalTX | PhysicalTX | Type | SystemInfo | Profiles |
- SiteID: This is an identifier for the site as specified in the sites.txt file. It is the name of a site on which the executable is installed or available via GridFTP or http.
- LogicalTX: This is the logical name (LFN) of the transformation (executable).This name is written in the format:
where version does not refer to the actual version of the executable. Example:
ligo::lalapps_inspiral::1.0
- PhysicalTX: This is the physical file name (PFN) or the transformation (executable). It is either a full path to that executable (if installed on the remote site) or a GridFTP path leading to the executable on another site. Examples:
gsiftp://ldas-grid.ligo.caltech.edu/archive/home/dietz/LAL/bin/lalapps_inspiral
- Type: This specifies the type of the transformation (executable). At this time two types are supported:
STATIC_BINARY: If the executable can be transfered as a static binary from another site
- SystemInfo: This parameter contains the architecture, the
OS and
glibc version for which the transformation is compiled. The default is
to use INTEL32::LINUX.
- Profiles: The profiles for a transformation can be defined in the format:
where to use double quotes for value.
cit transfer /archive/home/dietz/Install/vds/bin/transfer INSTALLED INTEL32::LINUX vds::bundle_stagein=1
cit dirmanager /archive/home/dietz/Install/vds/bin/dirmanager INSTALLED INTEL32::LINUX
local RLS_Client /opt/ldg-3.5/vds/bin/rls-client INSTALLED INTEL32::LINUX
local ligo::lalapps_tmpltbank:1.0 gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_tmpltbank STATIC_BINARY INTEL32::LINUX
local ligo::lalapps_inspiral:1.0 gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_inspiral STATIC_BINARY INTEL32::LINUX
local ligo::lalapps_inca:1.0 gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_inca STATIC_BINARY INTEL32::LINUX
local ligo::lalapps_thinca:1.0 gsiftp://dietz.phys.lsu.edu/home/alex/Executables/lalapps_thinca STATIC_BINARY INTEL32::LINUX
FAQ
This section is a uncomplete section that summarizes some frequently asked question on problems with running on the grid.Problems while creating a concrete DAG with gencdag:
- Some other sites.xml files is used, no the file that is specified in the properties file!
echo ${VDS_HOME}
unset CLASSPATH
source ${VDS_HOME}/setup-user-env.sh
- Error: "Can't determine an location to transfer input file for lfn"
- Error: "Could not authenticate against any site. Probably your credentials were not generated or have expired"
globus-job-run hydra.phys.uwm.edu/jobmanager-condor -l /bin/hostname
- Error: "java.lang.OutOfMemoryError"
export VDS_JAVA_HEAPMIN 512
export VDS_JAVA_HEAPMAX 1024
- Error: "org.globus.replica.rls.RLSException: IO timeout: globus_io_register_read() timed out after 30 seconds"
- There is no out-file returned to my local machine, so I cannot check the status of the job:
- Jobs seem to run for a very long time, but nothing happens:
Where to go for help
If you have further questions or problems, you can take a look at the full documentation or mail the griphynligo mailing list.