Updated LDR planning document from Jan 07 meeting

These notes are based on the previous LDR planning document from the April 23, 2004 meeting. This revision is largely based on the discussion of the attendees of the January 2007 meeting at Caltech.

A Note about LDR's Core Mission and Requests for Enhancements and Features

The group agreed that LDR's core mission needs to be replicating LIGO and GEO data sets from source sites to analysis sites quickly and robustly. The highest priority has to be making LDR as robust and fault tolerant as possible so as to require as little intervention from administrators as is possible, and no functionality or enhancement should be contemplated if doing so detracts from this goal.

Opening Up LDR?

There once existed pressure to "open LDR" so that LDR is not only used by administrators to replicate bulk data such as interferometer data and SFTs, but also to allow any LSC user to publish any files into LDR and have them replicated (if so desired) to other LSC sites.

This use of LDR was discussed at a previous meeting. There is much concern that the requirements of replicating user data and bulk "collaboration" data are different enough such that a second product similar to LDR but specialized for users should be developed and deployed. At the same time it is recognized that the LSC probably doesn't have the FTEs available to develop and deploy such a product and it might have to rely on LDR for this task.

This remains an idea to be thought about...

Speeding up the transfer of small files

Due to the fundamental limitations of TCP transferring small (less than 100 MB) files over the WANs we use is much less efficient then transferring large files. The difference in transfer rates can easily be a factor of ten.

The solution to this problem appears to be pipelining in GridFTP. Details here: http://www-unix.mcs.anl.gov/~bresnaha/LOSF/ This work should appear in pyGlobus soon and then will be easily harnassed by LDR.

Authorization for data access at the LDR level

At this time authentication within LDR is done with Globus GSI but authorization is handled only at the UNIX filesystem level. A `datarobot' agent from a site such as UWM uses a X.509 certificate and GSI proxy to authenticate to a GridFTP server and then the certificate if mapped to a local UNIX account such as `uwmrobot'. If the `uwmrobot' account has the correct UNIX permissions to access data then access is authorized.

Ideally authorization to access various data sets should also be handled within the GSI security infrastructure. A `datarobot' presenting a certificate from UWM for example could be given access to only certain data sets and not to others without any dependence on the details of the underlying filesystem.

A large amount of research is being done within the Grid community to make authorization via GSI a reality and there are a number of products available for us to test. Integrating this into the LDR architecture and code base, however, would require a large effort. At this time we do not see this as necessary for S4-S5.

Smooth upgrades with migration of state in MySQL

This is mostly the case. Some scripts should be written to make this even easier.

"Green/Red light at-a-glance" LDR status web page

The green/red light pages used in the control rooms at LHO and LLO and also at CIT are popular and something similar should be available for LDR in order for admins to quickly determine if LDR problems exist. We did not have time to draft a particular design but did discuss some of the information that might be incorporated on such a page. In general such a page should be extensible and it should be easy for new types of information to be added.

Nagios (NRPE?) service at LDR sites to let us monitor LDR.

Gauge for percentage of collections transferred

There should be a simple way for administrators to determine what fraction of a collection (such as `S5 RDS L3 LHO') has been replicated. This information should probably be displayed on the green/red-light web page mentioned above, but it should probably also be available in other ways either by a command-line or ?

Display transfer rates from sites / Archive transfer rate information

It would be helpful to be able to determine at a glance what transfer rates a site is achieving when replicating data from other LDR sites. Most likely this information would be displayed on the status web page.

A student at PSU has done some work on this...

Scheduling across multiple sites/collections

During S3 it happened occasionally that the network to LLO would go down while the network to LHO was still up. Due to the current way that LDR schedules files for replication data would stop flowing from both LLO and LHO while the LLO path was down.

Measure aggregate throughput for file transfers

For some sites (in particular MIT) it would be helpful to measure the aggregate transfer rate rather than just the rates to individual sites. This may or may not be best left to some other tool and so we gave it a difficulty of 6 because we will have to explore what tools are available and whether or not they can be/should be integrated into LDR. It was given a priority of B since it is not necessary for S4-S5.

Scheduling based on archived transfer performance

Ideally we would like to schedule transfers in a way that takes into account the current network weather so as to be able to move as much data as fast as is possible. To do this we can archive (for some time, perhaps in MySQL or just in memory) transfer performance, but also perhaps use information from other sources.

Possibility: Queue them all up and let each site transfer take the next one it can do. Create a stub callout that can prioritize based on whatever and fill it in later. ([site], state_hist_info) -> site

"On-the-fly" configuration

Administrators should be able to make configuration changes to LDR and then have the changes go into effect by sending LDR daemons a HUP (or the like) signal.

There are still problems with signal handling & latest (py)Globus.

Separate log file for critical error messages

A number of admins would like to have CRITICAL error messages logged to an additional log file along with the standard log file for the daemon in order to more easily monitor LDR. It is almost trivial to do this.

This didn't get implemented. People don't seem to care much, probably because things got stable and the issue became unimportant.

Perhaps refine error messages / put in keyword instead? Maybe send CRITICAL messages to syslog

Validity checking at server and client end for each transfer

Size should be 'officially' added as md5 checksum was.

Proxy LDR: If LSCdataFind request fails up the priority for the failed file

It has been requested that whenever a request to LDRdataFindServer fails to find a file locally the priority for replication of the missing file be automatically increased. There is concern that this feature could be easily abused and so although not too difficulty to implement it was given a priority of C.

No! This idea is currently mostly frowned upon.

More generally, are there hooks in LDR that can guide xfer scheduling? Sort of. At the level of Collection, this is possible.

Was thought of as being a way to inform LDRVerify to find missing data that should be there...

Data discovery for publishing into LDR

Stuart Anderson, Scott Koranda, and Patrick Brady suggested that the diskcacheAPI from LDAS could be leveraged and used in cooperation with LDR (or become a part of LDR?) for data discovery and automatic publishing.

The idea is that the diskcacheAPI, which is very good at watching over a set of NFS or local directories and building a hash table in memory of frame-file locations and limited metadata (the metadata available in the name of the file), could be used to "watch" filesystems and somehow communicate with LDR. This would aid in the automated publishing of data into LDR and also with unpublishing and republishing when data is moved and it is necessary to update the URLs stored within LDR.

Hari in his presentation also discussed how queries to LDRdataFindServer or even framequery could go directly to the diskcacheAPI with a little effort to bend the diskcacheAPI to fit the LDRdataFindServer protocol.

Ben has done this at the sites (LLO, LHO). Concerns over portability...

LDRVerify should be easily extensible

Kevin Flasch has rewritten LDRVerify at UWM so that it can be used to continually check the veracity of the data. Currently it checks to make sure that data exists, has the correct filesize and the correct uid, gid, and permissions. Soon it will check computed md5 checksums against those computed at the time of publication. Should this become part of the diskcacheAPI-thing?

Basically not documented. RLS may have tools that help with this? High priority.

Unpublishing files from LDR

The admins have all written many times "once-off" scripts for unpublishing files from LDR. It should not be difficult to collect these and formalize them into a useful tool.

Changes to RLS client, such as the client reading file arguments from a file? Should we create our own RLS client tool?

We need to create an LDR admin toolkit, with things like LDRpc, LDRrm, LDRmv, etc...


$Id: plans_meeting_jan07.html,v 1.3 2007/02/28 21:23:05 kflasch Exp $
LDR Logo
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.