Recovery from failed disks


These instructions are intended to help guide an LDR administrator when a disk or set of disks storing data replicated by LDR has failed and the data is no longer available and must be replicated again.

These instructions assume that the LDR administrator is proficient in the use of command shell tools like awk, sed, and xargs.

Since LDR depends on the mappings from files to URLs that are held in the RLS catalog to determine which files a site does or does not have, when files go missing on disk the corresponding mappings must be removed from the RLS catalog.

Note that it is common for a file to be mapped to more than one URL and that all mappings for a file must be removed from RLS before LDR will recognize that a site no longer has a file and it must be replicated again.

To determine which mappings must be removed and then to remove them follow these steps (note that it is not necessary to shut down LDR):

  1. Determine mappings to failed disk:

    Use the globus-rls-cli command to search and find all the mappings corresponding to the failed disk(s). The form of the command is

    globus-rls-cli query wildcard lrc pfn <pattern> rls://localhost
    

    Here <pattern> is the pattern to match against the URL or physical file name (PFN). Patterns use the standard Unix wildcard characters: an asterisk (*) matches 0 or more characters, and a question mark (?) matches any single character. You probably want to use double quotes ("") around your pattern to protect it from the bash/csh shell.

    We recommend piping the output to a file since you probably have a lot of URLs corresponding to any single disk. Note that if your RLS contains a "large" number of mappings it may take a while for this command to return since the underlying relational database must do a complete table scan through all the URLs listed.

    Here is an example command that one might use to find all files that have the text "nfsdata13" in their saved URLs/PFNs/paths:

    globus-rls-cli query wildcard lrc pfn "*nfsdata13*" rls://localhost > disk13mappings
    

    Here is the first 10 lines of output from that above command:

    [datarobot@nemo-dataserver var]$ head disk13mappings
    GHLTV-GA2_S5_A-815273413-600.gwf: file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815273413-600.gwf
    GHLTV-GA2_S5_A-815273413-600.gwf: file://nfsdata13.nemo.phys.uwm.edu/export1/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815273413-600.gwf
    GHLTV-GA2_S5_A-815273413-600.gwf: gsiftp://nemo-dataserver.phys.uwm.edu:15000/data/nemo/storage/data/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815273413-600.gwf
    GHLTV-GA2_S5_A-815274013-600.gwf: file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274013-600.gwf
    GHLTV-GA2_S5_A-815274013-600.gwf: file://nfsdata13.nemo.phys.uwm.edu/export1/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274013-600.gwf
    GHLTV-GA2_S5_A-815274013-600.gwf: gsiftp://nemo-dataserver.phys.uwm.edu:15000/data/nemo/storage/data/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274013-600.gwf
    GHLTV-GA2_S5_A-815274613-600.gwf: file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274613-600.gwf
    GHLTV-GA2_S5_A-815274613-600.gwf: file://nfsdata13.nemo.phys.uwm.edu/export1/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274613-600.gwf
    GHLTV-GA2_S5_A-815274613-600.gwf: gsiftp://nemo-dataserver.phys.uwm.edu:15000/data/nemo/storage/data/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274613-600.gwf
    GHLTV-GA2_S5_A-815275213-600.gwf: file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815275213-600.gwf
    

    Repeat the query as necessary for each disk or partition that has failed or for which you suspect data files have gone missing.

  2. Verify mappings are bad and files are missing:

    If you are confident you lost an entire disk or partition then you should skip this step and go onto the next step.

    If an entire disk or partition did not fail and you are not sure if the data really is missing, you need to check and see if the mappings you just found in RLS really are no longer valid.

    The easiest thing to do is to use the "ls" shell command for each file and record a list of the failures. You will need to filter the list of URLs appropriately so that you only test those which make sense for the command "ls". For example:

    grep file://localhost disk13mappings | sed -e 's/file:\/\/localhost//' | awk '{print $2}' | xargs -i ls {} 1> goodFiles 2> badFiles
    

    Since by default 'ls' will print files it cannot list to stderr then the file badFiles should contain a list of files that really have gone missing and whose mappings must be removed from RLS.

  3. Prepare list of mappings to delete from RLS:

    Unfortunately the output from globus-rls-cli command you ran above cannot be used directly as input to RLS to remove the mappings. Likewise, if you had to filter the output to verify precisely which files are missing, then you need to create an input file of the correct form containing the mappings to remove.

    The form of input file is simple. It must contain the file (LFN) and a single URL (PFN) per line and separated by whitespace. For example:

    GHLTV-GA2_S5_A-815273413-600.gwf file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815273413-600.gwf
    GHLTV-GA2_S5_A-815273413-600.gwf file://nfsdata13.nemo.phys.uwm.edu/export1/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815273413-600.gwf
    GHLTV-GA2_S5_A-815273413-600.gwf gsiftp://nemo-dataserver.phys.uwm.edu:15000/data/nemo/storage/data/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815273413-600.gwf
    GHLTV-GA2_S5_A-815274013-600.gwf file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274013-600.gwf
    GHLTV-GA2_S5_A-815274013-600.gwf file://nfsdata13.nemo.phys.uwm.edu/export1/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274013-600.gwf
    GHLTV-GA2_S5_A-815274013-600.gwf gsiftp://nemo-dataserver.phys.uwm.edu:15000/data/nemo/storage/data/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274013-600.gwf
    GHLTV-GA2_S5_A-815274613-600.gwf file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274613-600.gwf
    GHLTV-GA2_S5_A-815274613-600.gwf file://nfsdata13.nemo.phys.uwm.edu/export1/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274613-600.gwf
    GHLTV-GA2_S5_A-815274613-600.gwf gsiftp://nemo-dataserver.phys.uwm.edu:15000/data/nemo/storage/data/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815274613-600.gwf
    GHLTV-GA2_S5_A-815275213-600.gwf file://localhost/nfsdata/nfsdata13/S5/GA2_S5_A/GHLTV/815273000-815282999/GHLTV-GA2_S5_A-815275213-600.gwf
    

    The output of the previous globus-rls-cli command nearly has this form; you simply need to remove the ":" between the LFN and PFN. An easy way to do this is using sed:

    sed -e 's/://' disk13mappings > disk13mappingRemovalInput
    

    Remember that all mappings for a file need to be removed from RLS before that file will be scheduled for replication. Just removing the file:// URLs is not enough.

  4. Remove bad mappings from RLS:

    With a file containing all of the bad mappings you can use the globus-rls-cli command with the -i option to easily remove all of the mappings. If your globus-rls-cli command does not accept the -i flag then you missed an update sent out by email. Send email to the LDR list and ask for the update again.

    The form of the command is

    globus-rls-cli -i <file with mappings to remove> bulk delete rls://localhost
    

    Note the use of the modifier bulk in the command above. It is necessary if you use the -i option. Without it you can only delete one mapping per invocation of the command.

With the RLS catalog cleared of mappings for files that no longer exist your LDR is now ready to replicate those files.

Be sure to configure your LDR collection.ini file appropriately so that the files you need to replicated again are defined within a collection. See the LDR administrator's manual for details.

LDR Logo
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.