New todo list

Slaves


ganglia needs to be built for the cluster, accounted for in KS and CVS, and deployed

LIGO software (matlab, ligotools) need to be configured, accounted for in KS and CVS, and deployed

yum needs to have lscsoft configuration added, accounted for in KS and CVS, and deployedDeployed and added to CVS

nfs automounts for LIGO data (exports, autofs entries and mountpoints; for nfs storage nodes and for slave nodes) need to be accounted for, added to CVS and deployed on compute nodes. Done, tested by having each slave look at another slaves data ($slave+1)

smartmontools configured and running on cluster nodes.  Periodic testing running. Done, but it's curently emailing to Bruce

condor needs to be configured, accounted for in KS and CVS, and deployed. Done, tested by recloning a node and watching the node join the pool automatically after reboot, cloned on 6/28

Masters

master0001

nut-cgi needed nfs automounts for LIGO data (exports, autofs entries and mountpoints; for nfs storage nodes and for slave nodes) need to be accounted for, added to CVS and deployed on compute nodes. Done, looking at several slaves data

master0002

LIGO software installation (matlab, ligotools) needs to be accounted for in KS and CVS. ganglia yum needs to have lscsoft configuration added, accounted for in KS and CVS, and deployedDeployed and added to CVS. yum needs to be configured for local repo's (base and updates), be accounted for in KS and CVS.Deployed and added to CVS nut client configured Deployed and added to CVS User profiles auto set up (from Caltech's profile.d files) Initial pass done. No Matlab or ligotools, yet nfs automounts for LIGO data need to be accounted for, added to CVS and deployed.Done, tested by recloning marlin condor configured for a submit machine. Done, install completely automated, tested by recloning marlin on 6/29

condorcm

ganglia yum needs to be configured for local repo's (base and updates), be accounted for in KS and CVS.Done, deployed and checked into CVS nut client configuredDone, deployed and checked into CVS condor configured for central managerDone, install completely automated, cloned condorcm on 6/29 to verify

room temperature monitoring machine

a machine with multiple serial ports to monitor room temp has been requested by Alan. We've picked out and minimally tested hardware. we need to find a xmgrace pkg to install

Storage-nodes

slides are needed to mount the remaining machines

machines need to be mounted in cabinets.

nfs0001 needs to be set up for user homes. Done, tested by creating user parmor's home directory and accessing from nodes

cabinets are needed for the remaining machines They arrived, awaiting deployment

General issues

nut documentation needed (explaining ups on battery, and building alarm1 = on battery, what order will machines shut down).

UPS maint. documentation and training needed (different maint. scenarios)

A/C 

user account documentation needed. added to maintenance section of nemo pages

compute node maint. documentation needed:
	HDD Done.  Added to nemo maint. section.
	Other components

nfs node maint. documentation needed

masternode maint. documentation needed

yum fedora and fedora-updates mirror

yum lscsoft mirror


Original list

SHORT TERM

Tweak kernel parameters for better network performance

run 'rpm --verify -a' to generate list of modified files; check by hand that all needed ones are already in CVS.

Add some type of 'indicator' to the BIOS update/set and IPMI firmware update PXE image, which provides visual
or tactile feedback that the process has completed successfully.  Examples: eject CD drawer, or power off node.


MEDIUM TERM


Complete benchmark testing of the NFS server box prototype. Bruce will do this

smartmontools configured and running on cluster nodes.  Periodic testing running. Done, but curently emailing to Bruce

Go through /var/log/messages for signs of errors and/or trouble

Be sure that we can kickstart nemo and get a reasonable build

LONG TERM

UPS monitoring working from one head machine (nominally nemo) then also from other machines Done, BUT a document is needed that describes how ups-slaves and master are configured

Understand and fix the check_link_status script called from /etc/sysconfig/network-scripts/ifup-eth

conserver running on a separate box

SNMP and other monitoring of the switch itself and the LAGS if any.

populating yellow pages with /etc/passwd, /etc/shadow, /etc/auto.home, /etc/auto.mnt, /etc/group

Also monitor switch power supply (long term)

COMPLETED

Network testing to choose edge switches Done. We're going with SMC's

May want to test the bcm5700 driver Postponed. We think we'll use the tg3 driver

Get kernel RPM built for 2.6.15.4, including Areca driver Done. Installed on s0033

ipmitool working locally on slaves Done.  Doc describing basic usage under "Maintenance"

ipmitool working on nemo head node so that we can connect to slaves Done. Doc describing basic usage under "Maintenance"

serial port hardware flow control w/ipmi working so we don't lose characters. See http://sourceforge.net/mailarchive/message.php?msg_id=14617434
Also see http://sourceforge.net/search/?type_of_search=mlists&exact=1&forum_id=36436&group_id=95200&atid=0&words=hardware+flow+control&Search=Search
Done and tested, BIOS, syslinux/anaconda, kernel, agetty are all configured for hardware flow control

mcelog installed and working (monitor ECC errors). Added to comps.xml for Nemo Common. Not configured

sensorcheck script running on slaves (requires IPMItool) Done.  ipmisensorscheck.rpm in cvs.  Needs testing.

lm_sensors on slaves Added to comps.xml for Nemo Common. Not configured

Script to check nemo files against CVS Done. Checked into CVS.

Yum configuration on slave nodes.  Should pemit 'yum install' of a standard FC4 package. Done

Convert to MTU=4500 EVERYWHERE.  PXE-scripts, cloning, dchp tables, etc. Done and tested.

Move script that conditionally sets IMPI card ip address and netmask to /etc/rc.local.Done and tested.

If correct user/password does not exist on IPMI card, then set it.  Make rc.local readable only by root.A script that simply deletes all users and creates desired users is found at /root/root_tools/BMC-config.sh on newly cloned nodes

Install squid on nemo head node (m0001) as part of kickstart.  Add config file now in /etc/squid/squid.conf to set of configfiles.Done, checked into cvs, and squid starts on a reboot of nemo

Get outgoing mail working from the slaves Done 2006-03-08

scp on and off Force10 switch and saving config file in CVSDone, nemo:/root/Force10-config/startup-config and in CVS

Get web cam installed into cluster room Done

Know how to build /dev/md software RAID device on each node (and perhaps modified kickstart). Done, Paul has a ks.cfg that will do this - PAUL-ks-slave_FC4x86_64.cfg

Put areca CLI interface stuff into slave kickstart (CVS tarball). Done, put on storage nodes under /root/root_tools.