UWMLSC > Beowulf Systems > Nemo
  

Recovering from an NFS storage node crash

These steps are to be taken after an NFS storage node crashes or locks up (and it must be hard booted), especially while it may have been reading or writing.

  1. Get the node into a bootable state and enter single user mode.

  2. Edit /etc/exports, comment out the line for allowing slaves and masters from mounting (contains reference to 192.168.0.0/255.255.248.0). This is necessary so users cannot access data while the data is being verified and cleaned up.

  3. Change to run level 3.

  4. Verify the integrity of the data and clean up any corrupt data.

    1. Log in to nemo-dataserver as datarobot and run source /opt/LDR/setup.sh
    2. Regenerate the latest checksums on the specific nfsdata machine by running /people/datarobot/bin/runNemoStorageChecksumScript.sh [nfsdata node number] unless you are very skeptical of the machine checksumming its own data. This will take a long time depending on the amount of data (hours on the fuller nodes). Monitor /export1/checksums on the nfsdata node to watch progress.
    3. Run /people/datarobot/bin/verifyNemoNode.py [nfsdata node number]. This is a preliminary step and will not delete anything (in case a huge amount of corrupt data is found and should therefore be looked into in more detail).
    4. If any corrupt data is found check and compare the creation/modification time with the access time to see if the data has ever been accessed (for example read by a scientific analysis code). You can check the creation time by doing
           ls -l --time=ctime H-RDS_R_L1-843996224-64.gwf
           
      You can check the access time by doing
           ls -l --time=atime H-RDS_R_L1-843996224-64.gwf
           
      If the data has been accessed since it was created (put on the disk by LDR) record the results for use in the next steps.
    5. If there is corrupt data run /people/datarobot/bin/verifyNemoNode.py -d [nfsdata node number] to actually delete and unregister corrupt data.
    6. If corrupt files are found AND the data has an access time that is different then the creation time send an announcement to daswg (use email draft here).

  5. After the data on the node has been verified and corrupt data is removed and unregistered, uncomment the slaves and masters export entry as done above, and run exportfs -rav. This allows users to once again access the data.
Check this page for dead links, sloppy HTML, or a bad style sheet; or strip it for printing.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.