UPS Testing

Power Failure Results

On Tuesday May 4 a large portion of the campus lost power for about an hour. Below are the details of what happened with the cluster.

  1. Power failed at around 15:27
  2. Nodes Shutdown
    Between 15:29:24 and 15:29:48 nodes started shutdown process.
    Between 15:29:55 and 15:30:02 upsd on ups maters "disconnected"
  3. UWMLSC Powered Down
    15:27:21 UWMLSC reported being on battery
    15:47:21 User requested FSD!
    15:47:37 System is being shutdown by UPS
    15:47:52 127.0.0.1 disconnected
    15:48:03 UWMLSC exited on signal 15
  4. Powerware UPS failed on switch at about 16:06:32
  5. The following machines were then shutdown by hand:
    15:34:42 contra told to shutdown - 15:35:22 contra went down
    15:34:55 hydra told to shutdown - 15:35:37 hydra went down
    15:37:02 condor told to shutdown - 15:37:14 condor went down
    15:39:40 hades
    15:42:54 tigger
    15:43:42 watchtower
    15:45:26 nest
    15:46:26 kanga
    15:47:24 dataserver started shutdown (nuts) - 15:47:52 down
    15:51:56 gravity FSD -> 15:53:52 shutdown (nuts) - 15:56:12 gravity (powered back on for email) - 16:03:41 down again
    15:55:40 medusa rebooted - 16:03:51 medusa shutdown
  6. The following machines shut down hard after power was lost: 15:43:11 storage2 last log message
    16:04:13 storage1 started to go down - 16:09:10 last log entry, still not down
  7. Power was restored at around 16:25
  8. 17:10:56 slave nodes started coming up - a few had problems (3 batteriesand 1 cabling)
  9. 16:30:56 uwmlsc partly up - 16:39:44 uwmlsc partly up - 16:47:48 uwmlsc
    partly up - 16:54:07 uwmlsc up, yay! ** medusa set as primary
    DNS; "Starting NFS Services" hung but worked after medusa turned on.
    When other machines came up: 17:07:09 hydra
    17:08:41 contra
    17:05:54 condor
    17:38:46 kanga
    17:30:45 hades
    16:26:56 nest
    17:11:42 watchtower
    17:38:49 tigger
    16:55:51 gravity
    16:51:16 medusa
    16:35:51 storage1 partly up - 16:35:51 partly up - 16:52:42 up
    16:35:23 storage2 started coming up - 16:51:36 storage2 finished
    coming up (after medusa and thus DNS came up)