UWMLSC > Beowulf Systems > Nemo

Braindump of CIT trip

  1. OS installation notes
  2. Network configuration
  3. zpool design considerations
  4. zpool setup
  5. Setting up multiple zfs's in a pool
  6. NFS exporting zfs's
  7. zfs pool and fs properties
  8. Administering zpools and fs's

zpool design considerations

  • Talk about different zpool configs. Here's the config I think I want to use on at least one of ours (16TB of storage in ((7+2) * 5)+1):
  •           raidz2    ONLINE       0     0     0
                c0t1d0  ONLINE       0     0     0
                c1t1d0  ONLINE       0     0     0
                c4t1d0  ONLINE       0     0     0
                c5t1d0  ONLINE       0     0     0
                c6t1d0  ONLINE       0     0     0
                c7t1d0  ONLINE       0     0     0
                c0t2d0  ONLINE       0     0     0
                c1t2d0  ONLINE       0     0     0
                c4t2d0  ONLINE       0     0     0
              raidz2    ONLINE       0     0     0
                c5t2d0  ONLINE       0     0     0
                c6t2d0  ONLINE       0     0     0
                c7t2d0  ONLINE       0     0     0
                c0t3d0  ONLINE       0     0     0
                c1t3d0  ONLINE       0     0     0
                c4t3d0  ONLINE       0     0     0
                c5t3d0  ONLINE       0     0     0
                c6t3d0  ONLINE       0     0     0
                c7t3d0  ONLINE       0     0     0
              raidz2    ONLINE       0     0     0
                c0t4d0  ONLINE       0     0     0
                c1t4d0  ONLINE       0     0     0
                c4t4d0  ONLINE       0     0     0
                c6t4d0  ONLINE       0     0     0
                c7t4d0  ONLINE       0     0     0
                c0t5d0  ONLINE       0     0     0
                c1t5d0  ONLINE       0     0     0
                c4t5d0  ONLINE       0     0     0
                c5t5d0  ONLINE       0     0     0
              raidz2    ONLINE       0     0     0
                c6t5d0  ONLINE       0     0     0
                c7t5d0  ONLINE       0     0     0
                c0t6d0  ONLINE       0     0     0
                c1t6d0  ONLINE       0     0     0
                c4t6d0  ONLINE       0     0     0
                c5t6d0  ONLINE       0     0     0
                c6t6d0  ONLINE       0     0     0
                c7t6d0  ONLINE       0     0     0
                c0t7d0  ONLINE       0     0     0
              raidz2    ONLINE       0     0     0
                c1t7d0  ONLINE       0     0     0
                c4t7d0  ONLINE       0     0     0
                c5t7d0  ONLINE       0     0     0
                c6t7d0  ONLINE       0     0     0
                c7t7d0  ONLINE       0     0     0
                c0t0d0  ONLINE       0     0     0
                c1t0d0  ONLINE       0     0     0
                c4t0d0  ONLINE       0     0     0
                c6t0d0  ONLINE       0     0     0
              c7t0d0    AVAIL
  • Talk about size/redundancy/performance tradeoffs: sample config provides 5 logical-blobs/devices with each having 7*500GB per logical-blob/device (how are these referred to?) + double parity (RAID 6), plus one hot spare for all logical-blobs. This means we could survive two failures per blob/device and still run, this also means we'd have two spare/recoverable copies of data for potentially bad sectors. This means we could survive one disk failure. This attempts to balance performance across the 5 x 8132 "PCI-X tunnels"... the jury's out as to whether my assumptions are good.
  • Talk about performance characteristics: See above, for now.

Setting up a zpool with multiple logical blobs?

  • device naming: in Solaris disks are referred to using the naming scheme of cXtXdXsX - c=controller, t=target, d=LUN, s=slice
  • zpool create/add: this creates a pool called test, with two (in my example referred to above, I performed more zpool add's to flesh out to 5 blobs/devices):
  • zpool create -f test raidz2 c0t1d0 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0 c0t2d0 c1t2d0 c4t2d0
    zpool add test raidz2 c5t2d0 c6t2d0 c7t2d0 c0t3d0 c1t3d0 c4t3d0 c5t3d0 c6t3d0 c7t3d0
  • zpool add: see above for multiple blobs/devices. For a "spare" do something like:
  • zpool add test spare c7t0d0
  • for our work, it makes the most sense to use entire disks/devices and NOT use individual slices, which is possible (something like "zpool ... c0t1d0s0 c0t1d0s1")
  • zfs unmount here? For creating individual zfs's in a zpool for the sake of individual snapshots, etc.

Setting up multiple zfs in a zpool

  • OR zfs unmount here?
  • zfs creating multiples (for homes, one per user) for various reasons (incl snapshots, but what else was there?)
  • root@x4500-1 # zfs unmount /test
    root@x4500-1 # zfs create test/parmor
    root@x4500-1 # zfs mount test/parmor
    root@x4500-1 # zfs list
    NAME                   USED  AVAIL  REFER  MOUNTPOINT
    test                   235K  15.5T  59.3K  /test
    test/parmor           57.0K  15.5T  57.0K  /test/parmor

NFS exporting zfs's

  • can be exported by modifying the Solaris /etc/exports equivalent
  • can be exported via "zfs set sharenfs=on $options", can be per $home or globally ("inherited from") persistant across reinstalls!

zfs pool and fs properties

  • global properties
  • per zfs properties
  • properties we may be interested in setting: quota, mountpoint, sharenfs, checksum, compression, snapdir (visible or not)
  • properties we may be interested in viewing: type, available, compressratio, mounted, origin
  • here's a valid get command viewing test/parmor
  • root@x4500-1 # zfs get all test/parmor
    NAME             PROPERTY       VALUE                      SOURCE
    test/parmor      type           filesystem                 -
    test/parmor      creation       Fri Jul 13  1:20 2007      -
    test/parmor      used           57.0K                      -
    test/parmor      available      15.5T                      -
    test/parmor      referenced     57.0K                      -
    test/parmor      compressratio  1.00x                      -
    test/parmor      mounted        yes                        -
    test/parmor      quota          none                       default
    test/parmor      reservation    none                       default
    test/parmor      recordsize     128K                       default
    test/parmor      mountpoint     /test/parmor               default
    test/parmor      sharenfs       off                        default
    test/parmor      checksum       on                         default
    test/parmor      compression    off                        default
    test/parmor      atime          on                         default
    test/parmor      devices        on                         default
    test/parmor      exec           on                         default
    test/parmor      setuid         on                         default
    test/parmor      readonly       off                        default
    test/parmor      zoned          off                        default
    test/parmor      snapdir        hidden                     default
    test/parmor      aclmode        groupmask                  default
    test/parmor      aclinherit     secure                     default
  • here's an invalid command, but the "help" shows inheritance, how to view similar in a valid way?
  • root@x4500-1 # zfs get test
    invalid property 'test'
            get [-rHp] [-o field[,field]...] [-s source[,source]...]
    The following properties are supported:
            type             NO       NO   filesystem | volume | snapshot
            creation         NO       NO   
            used             NO       NO   
            available        NO       NO   
            referenced       NO       NO   
            compressratio    NO       NO   <1.00x or higher if compressed>
            mounted          NO       NO   yes | no | -
            origin           NO       NO   
            quota           YES       NO    | none
            reservation     YES       NO    | none
            volsize         YES       NO   
            volblocksize     NO       NO   512 to 128k, power of 2
            recordsize      YES      YES   512 to 128k, power of 2
            mountpoint      YES      YES    | legacy | none
            sharenfs        YES      YES   on | off | share(1M) options
            checksum        YES      YES   on | off | fletcher2 | fletcher4 | sha256
            compression     YES      YES   on | off | lzjb
            atime           YES      YES   on | off
            devices         YES      YES   on | off
            exec            YES      YES   on | off
            setuid          YES      YES   on | off
            readonly        YES      YES   on | off
            zoned           YES      YES   on | off
            snapdir         YES      YES   hidden | visible
            aclmode         YES      YES   discard | groupmask | passthrough
            aclinherit      YES      YES   discard | noallow | secure | passthrough
    Sizes are specified in bytes with standard units such as K, M, G, etc.

Administering zpools and fs's

  • example fun commands:
    	zpool iostat test
    	zpool iostat test 5
    	zpool offline test c5t5d0
    	iostat -nxz 5
    	vi /test/hello
    	zfs snapshot test@wednear4pm
    	zfs get all test
    	zfs set snapdir=visible test
    	cd /test/.zfs/snapshot/
    	cd wednear4pm/
    	ls -l
    	cd /test
    	rm /test/hello
    	cp .zfs/snapshot/wednear4pm/hello .
    	zfs list
    	zfs snapshot test@second
    	zfs list
    	zfs destroy test@wednear4pm
    	zfs list
    	cat /test/.zfs/snapshot/second/hello	
  • disabling and removing then replacing and enabling disks (in zfs, then in OS, then back again)
  • moving disk sets from one machine to another (gracefully and not)
  • growing zpools
  • snapshots, local and remote!
  • recovering from snapshots
  • manipulating snapshots
  • ???

Network configuration

  • The network ports are located on the back of the systems, arranged in a 2x2 block, and are labeled ports 0 1 2 3. These will correspond to the device names in Solaris (ours will be referred to as e1000gX, where X=0-3, since our systems are using Intel e1000's)
  • To use MTU's larger than 1500, must edit /kernel/drv/e1000g.conf, change the line to "MaxFrameSize=3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3;" (doc's there will explain the bitmasking). I think the system then needs to be rebooted; I have not found the proper order of downing the device and unplum'ing it to get it to work. NOTE, THIS WILL MOST LIKELY GET BLOWN AWAY WITH A KERNEL UPDATE!
  • Edit /etc/hostname.e1000g* and edit to look something like "x4500-0 mtu 4500"
  • edit /etc/inet/[hosts ipnodes netmasks ntp.conf] and /etc/[defaultrouter resolv.conf nsswitch.conf hostname.e1000g* defaultdomain] appropriately

Simple OS installation notes

  • DVD ISO for Solaris 10, x86_64, u4 or 08/07 can be found at nemo:/export1/Solaris10/sol-10-GA-x86-dvd.iso (Note, NOT in CVS)
  • The patch cluster I've been installing can be found at nemo:/export1/Solaris10/10u4/updates-10-26/10_x86_Recommended.zip (Note, NOT in CVS)
  • BMC/SP, I set the IP addr via dhcp (MAC on back of unit, as per Sun docs), I added it to dhcpd.conf on nemo, restarted DHCP on nemo, plugged the SP into the nemo network, and Voila!
  • Connected USB DVD drive with Sol 10 11/06 DVD
  • ssh'ed into the SP/BMC with "ssh root@bx4500-1" (the passwd is the Sun default)
  • I reset the system power with "reset SYS"
  • Connected to the system's serial console via "start SP/console"
  • Watched the system POST, and it started to boot off of the DVD, after which I did the following:
  • from DVD's grub menu, selected:
    	Solaris Serial Console ttya
    ** Be patient, very few visual pacifiers early on, don't be worried that it says it a 32-bit 
       version of the installer **
    from Solaris install method screen, chose:
    	 1.     Solaris Interactive (default)
    it probed tried to configure each of the NIC's
    and babbled about setting up Java
    selected a language:
    	0. English
    selected a term type:
    	3) DEC VT100
    ** From here on, one must press Esc-2 when prompted for F2 or Esc-2, the prompts are 
       inconsistent in this regard **
    	[X] Yes
    Network Interfaces:
    	[X] e1000g0
    DHCP for e1000g0:
    	[X] No
    Hostname for e1000g0:  ** Not FQDN, reconfiguration after install of the next number of 
                              settings will be explained later **
    IP address for e1000g0:
    System part of a subnet:
    	[X] Yes
    Netmask for e1000g0
    IPv6 for e1000g0
    	[X] No
    Default Route for e1000g0
    	[X] Specify one
    Router IP address for e1000g0
    A chance to review
    Configure Kerberos
    	[X] No
    A chance to review
    Name Service?
    	[X] None
    A chance to review
    Time Zone:
    	[X] Americas
    Country or Region:
    	[X] United States
    Time Zone:
    	[X] Central Time
    	set the clock, we'll configure NTP later.
    Root Passwd:
    	set appropriately
    Enabling Remote Services:
    	[X] No   ** This only leaves ssh open, we'll open things up later as we see fit, we'll 
                        also have to enable root ssh login in sshd.conf **
    enters into jumpstart preconfig and other stuff, we may be able to automate this?
    Selected "Standard" install with Esc-2
    Eject a CD/DVD
    	[X] Manually
    	[X] Manual
    Accept License
    Locale's  ** Hit enter to see options under a choice, eg  hitting enter here > [ ] North 
                 America, causes other options for North America to expand out **
    	I installed USA-UTF-8 AND ISO-8859-1
    Select System Locale
    	left default set to POSIX C
    Select Products ** Again, Hit enter to see options **
    	 > [ ] Solaris 10 Extra Value Software.................    0.00 MB
    	and select
    	   [X]     Sun Validation Test Suite
    No media for addl and select products
    Select Software
    	[X] Entire Distro
    Select Disk   ** As of 8/21/07, the machine can only boot off of either-of/both-of-if-mirrored 
                     the disks, with UFS, you must scroll down to find disks that can be booted 
                     off of, these device numbers seem to change between versions of kernel(?), 
                     but with this version, the selectable devices are c6t0d0 and c6t4d0 **
    Preserve Data?
    	no, however this is answered...
    Parition Disk?
    	set up at least 11GB for / on c6t0d0s0 11264 MB
    	set up at least 6GB for /var on c6t0d0s3 6000 MB
    	set up at least 2GB for swap on c6t0d0s1 2000 MB (more can be added later) 
    	leave overlap of whatever on c6t0d0s2
    Watch the pretty progress bar!
    After the install completes, detach the drive, then let the system reboot
    Installed Xorg server (as opposed to Sun's xserver)
    Override domainname, for NFS4, I've mixed yes and no to measure impact:
    	For more information about how the NFS version 4 default domain
    	name is derived and its impact, refer to the man pages for nfs(4)
    	and nfsmapid(1m), and the System Administration Guide: Network
  • update patches, I'm using a patch cluster I downloaded on 10/29. It can be found at nemo:/export1/Solaris10/10u4/updates-10-26/10_x86_Recommended.zip. It should be copied onto a machine, uncompressed. A readme called CLUSTER_README can be found in the uncompressed directory, and one should "grep PASS CLUSTER_README" will return a key to unlock the install script. Then the install_cluster script is run to apply patches.
  • To get the OS installed in a mirrored way, required a bit of work. This seemingly can be done in an automated way via jump start, but there isn't an intuitive way going through the manual install. This replicates what we did on my visit to CIT:
    • cfgadm -- tells what disks are installed on the system)
    • metastat -- shows the status of metadevices, which we're setting up. None should exist after a clean install as shown above.
    • format c6t0d0 -- this is the first bootable disk in our systems, the disk we installed onto earlier. A partition must be created for the metadata for our metadevice. Enter the format command, then at the format prompt, type partition, at the partition prompt type print (this shows the slices laid out during our initial install), look at the cylinder of last partition you created earlier (var is on the range of 1692 - 2456 in my example), now enter 7 to modify partition 7 (this is what CIT had done), don't enter a partition tag, don't enter permission flags, enter 2457 (ending cylinder above + 1), and enter a size of 32130b. Verify with a print that you have something like "7 unassigned wm 2457 - 2458 15.69MB (2/0/0) 32130". If so, then enter label to write out the new table; if not, try again. Enter quit to get out of the partition menu, then quit again to get out of the format menu.
    • format c6t4d0 -- repeat exactly as above.
    • prtvtoc /dev/dsk/c6t0d0s2 -- verifies your partitioning. Not s2 is a special slice in Solaris that describes the disk's partitions.
    • prtvtoc /dev/dsk/c6t0d0s2 | fmthard -i -s - /dev/rdsk/c6t4d0s2 -- fmthard updates the Volume Table of Contents, this command show's what you're about to copy from c6t0 -> c6t4 and the -i dumps to stdout w/o writing to c6t4 and the -s - reads from stdin (the output of the prtvtoc output). Also note that we're reading from the dsk device and writing to the rdsk device.
    • prtvtoc /dev/dsk/c6t0d0s2 | fmthard -s - /dev/rdsk/c6t4d0s2 -- actually does the copy/write
    • metadb -a -f c6t0d0s7 c6t4d0s7 -- initializes the metadata db on s7 of each of the disks
    • metadb -- should verify we've set something up.
    • swap -l -- will list swap info, we're going to start with our mirrored swap partition.
    • swap -d /dev/dsk/c6t0d0s1 -- disables our installed swap.
    • swap -l -- see, it's gone.
    • metainit d20 1 1 c6t0d0s1 -- creates a meta device d20 on s1 of our install disk, this will be a submirror in our mirrored device created below.
    • metainit d21 1 1 c6t4d0s1 -- creates d20's friend, our other submirror.
    • metastat -- will show we have two meta devices.
    • metainit d2 -m d20 d21 -- will create a metadevice d2, which is mirroring d20 and d21 created above (a mirror of the two submirrors).
    • metastat -- verifies what I just said.
    • swap -a /dev/md/dsk/d2 -- adds our new device as a swap device.
    • swap -l -- see.
    • cat /etc/vfstab -- shows what we mount on boot.
    • vi /etc/vfstab -- copy and edit the swap line such that it now refers to /dev/md/dsk/d2
    • metainit -f d10 1 1 c6t0d0s0 -- this may seem creepy, but you're setting up / so that it can be mirrored from t0d0s0 onto t4d0s0, our first submirror.
    • metainit -f d11 1 1 c6t4d0s0 -- setting up the other half, or second submirror.
    • metainit d1 -m d10 -- add the first half of the mirror.
    • metaroot d1 -- sets up vfstab for us, so / is now mounted off of /dev/md/dsk/d1, not sure why we can't yet attache d11 to d1.
    • cat /etc/vfstab -- verifies.
    • metainit -f d30 1 1 c6t0d0s3 -- sets up the first submirror.
    • metainit -f d31 1 1 c6t4d0s3 -- sets up the second.
    • metainit d3 -m d30 -- add the first half of the mirror.
    • metattach d3 d31 -- attaches the second half.
    • from here I manually edited /etc/vfstab, setting it to mount /dev/md/dsk/d3 on var, and telling it how to fsck d3 under the rdsk device. Copy the d1 on / line, and edit appropriately.
    • Reboot (init 6) and then do one more step
    • login and run metattach d1 d11 -- add the second half of /'s mirror.
  • Enable root ssh access by editing /etc/ssh/sshd_config and changing PermitRootLogin to yes, then "svcadm restart ssh" to restart the service.
  • Enable nfs client -- "svcadm enable nfs/client", but then run "svcs nfs/client" to see if it is running)
  • Disable print server -- "svcadm disable print/server" and"svcadm disable rfc1179")
  • Enable ntp -- first update /etc/inet/ntp.conf appropriatly, and then "svcadm enable ntp"
  • Make sure routing disabled, if using multiple interfaces "/usr/sbin/routeadm" should tell state?
  • If there's a preexisting zpool on the other 46 HDD's and you want to import it, type zpool import, which will scan the disks for anything that might be recovered and return info about it (I have an example of a pool that had been called "test"), and I then did a "zpool import -f test" and it magically found, imported and mounted my preexisting zpool and 2 zfs's (gskelton and parmor).

Check this page for dead links, sloppy HTML, or a bad style sheet; or strip it for printing.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.