Updated LDR planning document from Jan 07 meeting
These notes are based on the
previous LDR planning document from the April 23, 2004 meeting.
This revision is largely based on the discussion of the attendees of the January 2007 meeting
at Caltech.
A Note about LDR's Core Mission and Requests for Enhancements and
Features
The group agreed that LDR's core mission needs to be replicating LIGO and GEO data
sets from source sites to analysis sites quickly and robustly. The highest priority has to be
making LDR as robust and fault tolerant as possible so as to require as little intervention
from administrators as is possible, and no functionality or enhancement should be
contemplated if doing so detracts from this goal.
Opening Up LDR?
There once existed pressure to "open LDR" so that LDR is not only used by administrators
to replicate bulk data such as interferometer data and SFTs, but also to allow any LSC user
to publish any files into LDR and have them replicated (if so desired) to other LSC sites.
This use of LDR was discussed at a previous meeting. There is much concern that the
requirements of replicating user data and bulk "collaboration" data are different enough
such that a second product similar to LDR but specialized for users should be developed
and deployed. At the same time it is recognized that the LSC probably doesn't have the
FTEs available to develop and deploy such a product and it might have to rely on LDR
for this task.
This remains an idea to be thought about...
Speeding up the transfer of small files
Due to the fundamental limitations of TCP transferring small (less than 100
MB) files over the WANs we use is much less efficient then transferring large files. The
difference in transfer rates can easily be a factor of ten.
The solution to this problem appears to be pipelining in GridFTP. Details
here: http://www-unix.mcs.anl.gov/~bresnaha/LOSF/
This work should appear in pyGlobus soon and then will be easily harnassed by
LDR.
Authorization for data access at the LDR level
At this time authentication within LDR is done with Globus GSI but authorization is
handled only at the UNIX filesystem level. A `datarobot' agent from a site such as UWM
uses a X.509 certificate and GSI proxy to authenticate to a GridFTP server and then the
certificate if mapped to a local UNIX account such as `uwmrobot'. If the `uwmrobot'
account has the correct UNIX permissions to access data then access is authorized.
Ideally authorization to access various data sets should also be handled within the GSI
security infrastructure. A `datarobot' presenting a certificate from UWM for example
could be given access to only certain data sets and not to others without any dependence
on the details of the underlying filesystem.
A large amount of research is being done within the Grid community to make
authorization via GSI a reality and there are a number of products available for us to test.
Integrating this into the LDR architecture and code base, however, would require a large
effort. At this time we do not see this as necessary for S4-S5.
Smooth upgrades with migration of state in MySQL
This is mostly the case. Some scripts should be written to make this even easier.
"Green/Red light at-a-glance" LDR status web page
The green/red light pages used in the control rooms at LHO and LLO and also at
CIT are popular and something similar should be available for LDR in order for admins
to quickly determine if LDR problems exist. We did not have time to draft a particular
design but did discuss some of the information that might be incorporated on such a
page. In general such a page should be extensible and it should be easy for new types
of information to be added.
Nagios (NRPE?) service at LDR sites to let us monitor LDR.
Gauge for percentage of collections transferred
There should be a simple way for administrators to determine what fraction of
a collection (such as `S5 RDS L3 LHO') has been replicated. This information
should probably be displayed on the green/red-light web page mentioned above, but it
should probably also be available in other ways either by a command-line or ?
Display transfer rates from sites / Archive transfer rate information
It would be helpful to be able to determine at a glance what transfer rates a
site is achieving when replicating data from other LDR sites. Most likely this
information would be displayed on the status web page.
A student at PSU has done some work on this...
Scheduling across multiple sites/collections
During S3 it happened occasionally that the network to LLO would go down while
the network to LHO was still up. Due to the current way that LDR schedules files
for replication data would stop flowing from both LLO and LHO while the LLO path
was down.
Measure aggregate throughput for file transfers
For some sites (in particular MIT) it would be helpful to measure the
aggregate transfer rate rather than just the rates to individual sites. This may or may not be
best left to some other tool and so we gave it a difficulty of 6 because we will have to explore
what tools are available and whether or not they can be/should be integrated into LDR. It
was given a priority of B since it is not necessary for S4-S5.
Scheduling based on archived transfer performance
Ideally we would like to schedule transfers in a way that takes into account
the current network weather so as to be able to move as much data as fast as is possible.
To do this we can archive (for some time, perhaps in MySQL or just in memory) transfer
performance, but also perhaps use information from other sources.
Possibility: Queue them all up and let each site transfer take the next one it can do.
Create a stub callout that can prioritize based on whatever and fill it in later.
([site], state_hist_info) -> site
"On-the-fly" configuration
Administrators should be able to make configuration changes to LDR and then have the
changes go into effect by sending LDR daemons a HUP (or the like) signal.
There are still problems with signal handling & latest (py)Globus.
Separate log file for critical error messages
A number of admins would like to have CRITICAL error messages logged to an
additional log file along with the standard log file for the daemon in order to more easily
monitor LDR. It is almost trivial to do this.
This didn't get implemented. People don't seem to care much, probably because things got
stable and the issue became unimportant.
Perhaps refine error messages / put in keyword instead? Maybe send CRITICAL messages
to syslog
Validity checking at server and client end for each transfer
Size should be 'officially' added as md5 checksum was.
Proxy LDR: If LSCdataFind request fails up the priority for the failed file
It has been requested that whenever a request to LDRdataFindServer fails to find a file
locally the priority for replication of the missing file be automatically increased. There is
concern that this feature could be easily abused and so although not too difficulty to
implement it was given a priority of C.
No! This idea is currently mostly frowned upon.
More generally, are there hooks in LDR that can guide xfer scheduling? Sort of. At
the level of Collection, this is possible.
Was thought of as being a way to inform LDRVerify to find missing data that should be there...
Data discovery for publishing into LDR
Stuart Anderson, Scott Koranda, and Patrick Brady suggested that the diskcacheAPI from
LDAS could be leveraged and used in cooperation with LDR (or become a part of LDR?)
for data discovery and automatic publishing.
The idea is that the diskcacheAPI, which is very good at watching over a set of NFS or
local directories and building a hash table in memory of frame-file locations and limited
metadata (the metadata available in the name of the file), could be used to "watch"
filesystems and somehow communicate with LDR. This would aid in the automated
publishing of data into LDR and also with unpublishing and republishing when data is
moved and it is necessary to update the URLs stored within LDR.
Hari in his presentation also discussed how queries to LDRdataFindServer or even
framequery could go directly to the diskcacheAPI with a little effort to bend the
diskcacheAPI to fit the LDRdataFindServer protocol.
Ben has done this at the sites (LLO, LHO). Concerns over portability...
LDRVerify should be easily extensible
Kevin Flasch has rewritten LDRVerify at UWM so that it can be used to continually
check the veracity of the data. Currently it checks to make sure that data exists, has the
correct filesize and the correct uid, gid, and permissions. Soon it will check computed
md5 checksums against those computed at the time of publication. Should this become part
of the diskcacheAPI-thing?
Basically not documented. RLS may have tools that help with this? High priority.
Unpublishing files from LDR
The admins have all written many times "once-off" scripts for unpublishing files
from LDR. It should not be difficult to collect these and formalize them into a useful tool.
Changes to RLS client, such as the client reading file arguments from a file? Should we create
our own RLS client tool?
We need to create an LDR admin toolkit, with things like LDRpc, LDRrm, LDRmv, etc...
$Id: plans_meeting_jan07.html,v 1.3 2007/02/28 21:23:05 kflasch Exp $
|