LSC Data Grid (6 sources) Load

Navigation

General Information
LSC LIGO Scientific Collaboration
LIGO-Caltech
LIGO-Hanford Observatory
LIGO-Livingston Observatory

DataGrid Details

What is LSC DataGrid?
LDG Clusters Usage [Ganglia]
Available Data per site
Grid Service Details [Monitoring]

User Manual

How to get started
Install Data Grid Client
Getting Certificates
Account Request
SSH Login Portal
CVS/Bug Account Request
Request Software changes to SCCB

Admin Manual [(*) = optional]

Install DataGrid Server
Get server certificates
Configure/deploy Condor
Include site into Grid Monitoring
Graceful Condor shutdown
(*) Configure/deploy CondorView
(*) Configure Condor Flocking
(*) CondorC on LDG
LAMS / VOMS Admin [LSC internal]
Syracuse X4500 wiki [passwd required]
Edit these web pages

Request/Bug Tracking

Request Tracking System [RT]
LDG trouble ticket system

Policy

Reference O/S Schedule

LDG Collaborations

Condor-LIGO biweekly telecon
Globus-LIGO monthly telecon
LIGO VO in Open Science Grid [OSG]
Archival GriPhyN-LIGO WG pages

Exits

OSG

Results of OSG Demo

Summary of Problems

Authentication

  • Several pools failed the gsiftp test with Timeout experienced while reading from ip stream
  • Most pools rejected my DN to auth with their job manager, leaving 13 working pools.

Pegasus/VDS

  • The VDS program exitcode considers an empty file to be a success! This is bad, as several jobs that failed created empty .out files and exitcode incorrectly tells dagman that the job has succeeded.
  • The job inspiral_0_UTA_DPCC_cdir failed globus submission. Condor re-tried it several times then aborted it, creating an empty out file.

Data transfer

  • Several sites failed to transfer data from UWM and CIT to the local pools. Firewall issues?

Running Job

  • The Altas sites accepted the jobs, but they dissapeared into the void (possibly into a long queue?)
  • Several sites (including UWM) gave the error No such file or directory when they tried to execute the tmpltbank job.

Pool Namegencdag auth testmake workdirstage datarun tmpltbank
PROD_SLAC (7)
BNL_ATLAS_1 (27)
BNL_ATLAS_2 (27)
Purdue_ITaP (2)
GRASE_CCR_U2 (13)
NERSC_PDSF (9)
USCMS_FNAL_WC1_CE (10)
UCSandiegoOSG_Prod (11)
IU_ATLAS_Tier2 (12)
OSG_LIGO_PSU (22)
UWMilwaukee (24)
UTA_DPCC (14)
CIT_CMS_PG (4)
UFlorida_PG (1)
GRASE_CCR_ACDC (23)
Purdue_Physics (17)
Nebraska (25)
agt_bu_edu (21)
TTU_ANTAEUS (6)
GRASE_BINGHAMTON (18)
FNAL_GPFARM (5)
OUHEP_OSG (26)
ASCC_OSG (8)
GRASE_CCR_MAMA (14)
GRASE_ALBANY (3)
UIOWA_OSG_PROD (19)
FNAL_FERMIGRID (15)
UC_ATLAS_Tier2 (27)
FNAL_DDS2 (16)
  • (1) Timeout experienced while reading from ip stream of ufloridapg.phys.ufl.edu:2811
  • (2) Could not authenticate against jobmanager osg.rcac.purdue.edu/jobmanager-condor because Authentication with the remote server failed
  • (3) Could not authenticate against jobmanager grid.rit.albany.edu/jobmanager-condor because Authentication with the remote server failed
  • (4) Could not authenticate against jobmanager tier2b.cacr.caltech.edu/jobmanager-condor because Authentication with the remote server failed
  • (5) Could not authenticate against jobmanager fngp-osg.fnal.gov/jobmanager-condor because Authentication with the remote server failed
  • (6) Timeout experienced while reading from ip stream of antaeus.hpcc.ttu.edu:2811
  • (7) Could not authenticate against jobmanager osgserv01.slac.stanford.edu/jobmanager-lsf because Authentication with the remote server failed
  • (8) Could not authenticate against jobmanager osgc01.grid.sinica.edu.tw/jobmanager-condor because Authentication with the remote server failed
  • (9) Could not authenticate against jobmanager pdsfgrid2.nersc.gov/jobmanager-sge because Authentication with the remote server failed
  • (10) Could not authenticate against jobmanager cmsosgce.fnal.gov/jobmanager-condor because Authentication with the remote server failed
  • (11) Could not authenticate against jobmanager t2cms02.sdsc.edu/jobmanager-condor because Authentication with the remote server failed
  • (12) Timeout experienced while reading from ip stream of atlas.iu.edu:2811
  • (13) Could not authenticate against jobmanager u2-grid.ccr.buffalo.edu/jobmanager-fork because Authentication with the remote server failed
  • (14) Event: ULOG_GLOBUS_SUBMIT_FAILED for Condor Job inspiral_0_UTA_DPCC_cdir (1326.0)
    Event: ULOG_JOB_ABORTED for Condor Job inspiral_0_UTA_DPCC_cdir (1326.0)
    Running POST script of Job inspiral_0_UTA_DPCC_cdir...
    POST Script of Job inspiral_0_UTA_DPCC_cdir completed successfully.
    2005.07.21 10:20:48.238 CDT: [app] will use /Users/dbrown/projects/grid/vds/vds-1.3.6/etc/iv-1.4.xsd
    2005.07.21 10:20:48.254 CDT: [app] file has zero length inspiral_0_UTA_DPCC_cdir.out, assuming success
    2005.07.21 10:20:48.263 CDT: [app] exit status = 0
  • (15) Could not authenticate against jobmanager fermigrid1.fnal.gov/jobmanager-mis because Authentication with the remote server failed
  • (16) Could not authenticate against jobmanager cmsp4.fnal.gov/jobmanager-condor because Authentication with the remote server failed
  • (17) Could not authenticate against jobmanager grid.physics.purdue.edu/jobmanager-mis because Authentication with the remote server failed
  • (18) POST Script of Job rc_tx_GRASE_BINGHAMTON_0 failed with status 1
  • (19) POST Script of Job rc_tx_UIOWA_OSG_PROD_0 failed with status 1
  • (20) Running POST script of Job lalapps_tmpltbank_ID000113...
    Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job lalapps_tmpltbank_ID000113 (1356.0)
    POST Script of Job lalapps_tmpltbank_ID000113 completed successfully.
    2005.07.21 08:46:46.584 PDT: [app] will use /archive/home/dbrown/projects/grid/vds/vds/etc/iv-1.4.xsd
    2005.07.21 08:46:46.587 PDT: [app] file has zero length lalapps_tmpltbank_ID000113.out, assuming success
    2005.07.21 08:46:46.587 PDT: [app] exit status = 0
  • (21) POST Script of Job lalapps_tmpltbank_ID000153 failed with status 2
  • (22) POST Script of Job rc_tx_OSG_LIGO_PSU_0 failed with status 1
  • (23) POST Script of Job rc_tx_GRASE_CCR_ACDC_0 failed with status 1
  • (24) POST Script of Job lalapps_tmpltbank_ID000221 failed with status 2
  • (25) Event: ULOG_GLOBUS_SUBMIT_FAILED for Condor Job lalapps_tmpltbank_ID000057 (1433.0)
  • (26) Event: ULOG_JOB_TERMINATED for Condor Job lalapps_tmpltbank_ID000037 (1423.0)
    Job lalapps_tmpltbank_ID000037 completed successfully.
    Running POST script of Job lalapps_tmpltbank_ID000037...
    ULOG_POST_SCRIPT_TERMINATED for Condor Job lalapps_tmpltbank_ID000037 (1423.0)
    POST Script of Job lalapps_tmpltbank_ID000037 completed successfully.
    2005.07.21 09:38:16.318 PDT: [app] will use /archive/home/dbrown/projects/grid/vds/vds/etc/iv-1.4.xsd
    2005.07.21 09:38:16.321 PDT: [app] file has zero length lalapps_tmpltbank_ID000037.out, assuming success
    2005.07.21 09:38:16.322 PDT: [app] exit status = 0
  • (27) No apparent job failure, but no sucessful return. The jobs may be queued at the sites and I killed them before they got CPU time.
    000 (1468.000.000) 07/21 09:04:05 Job submitted from host: <131.215.115.58:53691>
        pool:BNL_ATLAS_1
    ...
    017 (1468.000.000) 07/21 09:04:15 Job submitted to Globus
        RM-Contact: gridgk01.racf.bnl.gov/jobmanager-condor
        JM-Contact: https://gridgk01.racf.bnl.gov:20001/26629/1121961849/
        Can-Restart-JM: 1
    ...
    
Supported by the National Science Foundation. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)
$Id$