UWMLSC > Beowulf Systems > Medusa

Data Recovery Procedure

Follow these steps if a node is beginning to exhibit hardware problems:

If a node(s) is going to be replaced for any reason:
  1. Please make these people aware: Scott (to stop LDR), Duncan (to make LDAS aware), and Paul.
  2. Please try to coordinate first with Scott so that he can stop LDR. In the future when LDR is a more robust system instructions will be available explaining how to do it yourself.
  3. If the node is still functioning and data can be accessed, all data should be copied off of the node to another storage area, before the node is taken offline. The copy area must be secure and not available to users.
  4. If the node is not functioning and data cannot be accessed then the node can be taken down (provided no other constraints), but its disk should immediately be put into another machine and if possible the data should be copied to secure storage.
  5. All available data should be copied onto the replacement node or disk before the node is made available again. If the cluster is down for maintenance it should not be made "publically available" again until the data transfer is complete since users assume access to data when the system comes back up.
  6. If no data is recovered then a node can be put back into place, but please notify Scott of the details asap.
  7. As soon as work is done Scott should be notified so that the LDR catalogs can be immediately reconciled to take account of any missing data.
  8. All data copied onto a node should be checked for correct ownership and permissions before the data is made publicly available again. The owner should always be 'datarobot' for all files and directories. The group should be 'datarobot' for all directories. Groups should be one of the data groups (currently e7, s1, s1rds). If in doubt set groups to be 'datarobot' and let Scott know. All permissions on data MUST be 640--no exceptions.
  • If machine is limping, but not dead (this requires one spare node):
    1. Copy the bulk of the data as described in the next step. If it is determined that the node does not need replacement
      until the next maintenance period, at that time you will need to
      compare the data on the limping node with that on the backup node. If there is a difference, copy the new files to your
      backups and shutdown the limping node immediatly.
    2. Copy The Data to a safe location:
      1. There are two data directories that need to be recovered /data (data on /hda) and /datc (data on /hdc). The data can be copied to a spare node, or in the event that none are available (this should never happen) see Paul about putting it on storage1. Permissions should be kept intact, so use the -a flag for the cp command, and be sure to check them when you are done.
      2. To begin the copying process, log into medusa as a super user. Then rsh into the spare node you are copying data to.
      3. Issue the command cp -a /netdata/sXXX /data where XXX is the 3 digit number of the limping node.
      4. This command will transfer (using nfs) the dying node's data to the new nodes local data partition. This can take up to an hour depending on the size of the data.
      5. Next copy the data on dat using cp -a /netdatc/sXXX /datc
      6. Verify that the data was indeed copied over and that the permissions remain intact. To do this run find /data -uid 0. This will find files that are owned by root. Also, run find /data -uid 812 |wc -l on the cluster node and spare node and diff the two. This will give you a number of files owned by datarobot.
    3. Note the MAC address on the new (spare) node.
    4. Swap MAC address entries in dhcpd.conf for the two nodes to be swapped.
    5. Restart dhcpd on the dhcp server (medusa).
    6. Power down both nodes and swap their node number labels on both front and back NOTE: not the labels with the MAC addresses appearing only on the back.
    7. Physically swap the two nodes on the shelf.
    8. Power on the new node, attatch a console and verify a clean start.
    9. Verify new node is accessible from the rest of the cluster.
    10. Notify the appropriate people, and begin repairs on the original limping node.
  • OR If the machine is dead (this requires two spare nodes, one to swap in, and one to put the disk in, to try to retreive the data:)
    1. If the node is not functioning and data cannot be accessed then the node can be taken down (provided no other constraints).
    2. Power down the new node and take it over to the dying node.
    3. Swap the entries in dhcpd.conf for the two nodes.
    4. Power down the spare node in which the recovered disk will go.
    5. Restart dhcpd on the dhcp server.
    6. Power on the new node with a console attached and verify a clean bootup.
    7. Remove the disk from the dead node and place in the spare node from which data recovery will be attempted.
    8. The data should be copied to secure storage, and repairs should be started on the dead node.
  • A data copy script can be found in the tools directory in CVS. You can also download it here: data_copy.sh

    $Id: data_recovery_procedure.html,v 1.12 2004/08/12 20:32:10 amiller Exp $
  • Check this page for dead links, sloppy HTML, or a bad style sheet; or strip it for printing.
    Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.