New todo list
ganglia needs to be built for the cluster, accounted for in KS and CVS, and deployed
LIGO software (matlab, ligotools) need to be configured, accounted for in KS and CVS, and deployed
yum needs to have lscsoft configuration added, accounted for in KS and CVS, and deployedDeployed and added to CVS
nfs automounts for LIGO data (exports, autofs entries and mountpoints; for nfs storage nodes and for slave nodes) need to be accounted for, added to CVS and deployed on compute nodes. Done, tested by having each slave look at another slaves data ($slave+1)
smartmontools configured and running on cluster nodes. Periodic testing running. Done, but it's curently emailing to Bruce
condor needs to be configured, accounted for in KS and CVS, and deployed. Done, tested by recloning a node and watching the node join the pool automatically after reboot, cloned on 6/28
nfs automounts for LIGO data (exports, autofs entries and mountpoints; for nfs storage nodes and for slave nodes) need to be accounted for, added to CVS and deployed on compute nodes. Done, looking at several slaves data
LIGO software installation (matlab, ligotools) needs to be accounted for in KS and CVS.
yum needs to have lscsoft configuration added, accounted for in KS and CVS, and deployedDeployed and added to CVS.
yum needs to be configured for local repo's (base and updates), be accounted for in KS and CVS.Deployed and added to CVS
nut client configured Deployed and added to CVS
User profiles auto set up (from Caltech's profile.d files) Initial pass done. No Matlab or ligotools, yet
nfs automounts for LIGO data need to be accounted for, added to CVS and deployed.Done, tested by recloning marlin
condor configured for a submit machine. Done, install completely automated, tested by recloning marlin on 6/29
yum needs to be configured for local repo's (base and updates), be accounted for in KS and CVS.Done, deployed and checked into CVS
nut client configuredDone, deployed and checked into CVS
condor configured for central managerDone, install completely automated, cloned condorcm on 6/29 to verify
room temperature monitoring machine
a machine with multiple serial ports to monitor room temp has been requested by Alan. We've picked out and minimally tested hardware.
we need to find a xmgrace pkg to install
slides are needed to mount the remaining machines
machines need to be mounted in cabinets.
nfs0001 needs to be set up for user homes. Done, tested by creating user parmor's home directory and accessing from nodes
cabinets are needed for the remaining machines They arrived, awaiting deployment
nut documentation needed (explaining ups on battery, and building alarm1 = on battery, what order will machines shut down).
UPS maint. documentation and training needed (different maint. scenarios)
user account documentation needed. added to maintenance section of nemo pages
compute node maint. documentation needed:
HDD Done. Added to nemo maint. section.
nfs node maint. documentation needed
masternode maint. documentation needed
yum fedora and fedora-updates mirror
yum lscsoft mirror
Tweak kernel parameters for better network performance
run 'rpm --verify -a' to generate list of modified files; check by hand that all needed ones are already in CVS.
Add some type of 'indicator' to the BIOS update/set and IPMI firmware update PXE image, which provides visual
or tactile feedback that the process has completed successfully. Examples: eject CD drawer, or power off node.
Complete benchmark testing of the NFS server box prototype. Bruce will do this
smartmontools configured and running on cluster nodes. Periodic testing running. Done, but curently emailing to Bruce
Go through /var/log/messages for signs of errors and/or trouble
Be sure that we can kickstart nemo and get a reasonable build
UPS monitoring working from one head machine (nominally nemo) then also from other machines Done, BUT a document is needed that describes how ups-slaves and master are configured
Understand and fix the check_link_status script called from /etc/sysconfig/network-scripts/ifup-eth
conserver running on a separate box
SNMP and other monitoring of the switch itself and the LAGS if any.
populating yellow pages with /etc/passwd, /etc/shadow, /etc/auto.home, /etc/auto.mnt, /etc/group
Also monitor switch power supply (long term)
Network testing to choose edge switches Done. We're going with SMC's
May want to test the bcm5700 driver Postponed. We think we'll use the tg3 driver
Get kernel RPM built for 126.96.36.199, including Areca driver Done. Installed on s0033
ipmitool working locally on slaves Done. Doc describing basic usage under "Maintenance"
ipmitool working on nemo head node so that we can connect to slaves Done. Doc describing basic usage under "Maintenance"
serial port hardware flow control w/ipmi working so we don't lose characters. See http://sourceforge.net/mailarchive/message.php?msg_id=14617434
Also see http://sourceforge.net/search/?type_of_search=mlists&exact=1&forum_id=36436&group_id=95200&atid=0&words=hardware+flow+control&Search=Search
Done and tested, BIOS, syslinux/anaconda, kernel, agetty are all configured for hardware flow control
mcelog installed and working (monitor ECC errors). Added to comps.xml for Nemo Common. Not configured
sensorcheck script running on slaves (requires IPMItool) Done. ipmisensorscheck.rpm in cvs. Needs testing.
lm_sensors on slaves Added to comps.xml for Nemo Common. Not configured
Script to check nemo files against CVS Done. Checked into CVS.
Yum configuration on slave nodes. Should pemit 'yum install' of a standard FC4 package. Done
Convert to MTU=4500 EVERYWHERE. PXE-scripts, cloning, dchp tables, etc. Done and tested.
Move script that conditionally sets IMPI card ip address and netmask to /etc/rc.local.Done and tested.
If correct user/password does not exist on IPMI card, then set it. Make rc.local readable only by root.A script that simply deletes all users and creates desired users is found at /root/root_tools/BMC-config.sh on newly cloned nodes
Install squid on nemo head node (m0001) as part of kickstart. Add config file now in /etc/squid/squid.conf to set of configfiles.Done, checked into cvs, and squid starts on a reboot of nemo
Get outgoing mail working from the slaves Done 2006-03-08
scp on and off Force10 switch and saving config file in CVSDone, nemo:/root/Force10-config/startup-config and in CVS
Get web cam installed into cluster room Done
Know how to build /dev/md software RAID device on each node (and perhaps modified kickstart). Done, Paul has a ks.cfg that will do this - PAUL-ks-slave_FC4x86_64.cfg
Put areca CLI interface stuff into slave kickstart (CVS tarball). Done, put on storage nodes under /root/root_tools.