|
|
|
Recovering from an NFS storage node crash
These steps are to be taken after an NFS storage node crashes or locks up (and it must be hard booted),
especially while it may have been reading or writing.
- Get the node into a bootable state and enter single user mode.
- Edit /etc/exports, comment out the line for allowing slaves and masters from mounting (contains
reference to 192.168.0.0/255.255.248.0). This is necessary so users cannot access data while
the data is being verified and cleaned up.
- Change to run level 3.
- Verify the integrity of the data and clean up any corrupt data.
- Log in to nemo-dataserver as datarobot and run
source /opt/LDR/setup.sh
- Regenerate the latest checksums on the specific nfsdata machine by
running /people/datarobot/bin/runNemoStorageChecksumScript.sh
[nfsdata node number]
unless you are very skeptical of the machine checksumming its own data. This will
take a long time depending on the amount of data (hours on the fuller nodes). Monitor
/export1/checksums on the nfsdata node to watch progress.
- Run /people/datarobot/bin/verifyNemoNode.py [nfsdata node
number]. This is a preliminary step and will not delete anything (in case
a huge amount of corrupt data is found and should therefore be looked into in
more detail).
- If any corrupt data is found check and compare the creation/modification time
with the access time to see if the data has ever been accessed (for example
read by a scientific analysis code). You can check the creation time by doing
ls -l --time=ctime H-RDS_R_L1-843996224-64.gwf
You can check the access time by doing
ls -l --time=atime H-RDS_R_L1-843996224-64.gwf
If the data has been accessed since it was created (put on the disk by LDR)
record the results for use in the next steps.
- If there is corrupt data run /people/datarobot/bin/verifyNemoNode.py -d [nfsdata node
number] to actually delete and unregister corrupt data.
- If corrupt files are found AND the data has an access time that is different
then the creation time send an announcement to daswg (use email draft here).
- After the data on the node has been verified and corrupt data is removed and unregistered, uncomment
the slaves and masters export entry as done above, and run exportfs -rav. This allows users to
once again access the data.
|