Braindump of CIT trip
- OS installation notes
- Network configuration
- zpool design considerations
- zpool setup
- Setting up multiple zfs's in a pool
- NFS exporting zfs's
- zfs pool and fs properties
- Administering zpools and fs's
zpool design considerations
- Talk about different zpool configs. Here's the config I think I want to use on at least one of ours (16TB of storage in ((7+2) * 5)+1):
raidz2 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t5d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c6t5d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t6d0 ONLINE 0 0 0
c7t6d0 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0
c7t7d0 ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
Talk about size/redundancy/performance tradeoffs: sample config provides 5 logical-blobs/devices with each having 7*500GB per logical-blob/device (how are these referred to?) + double parity (RAID 6), plus one hot spare for all logical-blobs. This means we could survive two failures per blob/device and still run, this also means we'd have two spare/recoverable copies of data for potentially bad sectors. This means we could survive one disk failure. This attempts to balance performance across the 5 x 8132 "PCI-X tunnels"... the jury's out as to whether my assumptions are good.
Talk about performance characteristics: See above, for now.
Setting up a zpool with multiple logical blobs?
- device naming: in Solaris disks are referred to using the naming scheme of cXtXdXsX - c=controller, t=target, d=LUN, s=slice
- zpool create/add: this creates a pool called test, with two (in my example referred to above, I performed more zpool add's to flesh out to 5 blobs/devices):
zpool create -f test raidz2 c0t1d0 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0 c0t2d0 c1t2d0 c4t2d0
zpool add test raidz2 c5t2d0 c6t2d0 c7t2d0 c0t3d0 c1t3d0 c4t3d0 c5t3d0 c6t3d0 c7t3d0
zpool add: see above for multiple blobs/devices. For a "spare" do something like:
zpool add test spare c7t0d0
for our work, it makes the most sense to use entire disks/devices and NOT use individual slices, which is possible (something like "zpool ... c0t1d0s0 c0t1d0s1")
zfs unmount here? For creating individual zfs's in a zpool for the sake of individual snapshots, etc.
Setting up multiple zfs in a zpool
- OR zfs unmount here?
- zfs creating multiples (for homes, one per user) for various reasons (incl snapshots, but what else was there?)
root@x4500-1 # zfs unmount /test
root@x4500-1 # zfs create test/parmor
root@x4500-1 # zfs mount test/parmor
root@x4500-1 # zfs list
NAME USED AVAIL REFER MOUNTPOINT
test 235K 15.5T 59.3K /test
test/parmor 57.0K 15.5T 57.0K /test/parmor
NFS exporting zfs's
- can be exported by modifying the Solaris /etc/exports equivalent
- can be exported via "zfs set sharenfs=on $options", can be per $home or globally ("inherited from") persistant across reinstalls!
zfs pool and fs properties
- global properties
- per zfs properties
- properties we may be interested in setting: quota, mountpoint, sharenfs, checksum, compression, snapdir (visible or not)
- properties we may be interested in viewing: type, available, compressratio, mounted, origin
- here's a valid get command viewing test/parmor
root@x4500-1 # zfs get all test/parmor
NAME PROPERTY VALUE SOURCE
test/parmor type filesystem -
test/parmor creation Fri Jul 13 1:20 2007 -
test/parmor used 57.0K -
test/parmor available 15.5T -
test/parmor referenced 57.0K -
test/parmor compressratio 1.00x -
test/parmor mounted yes -
test/parmor quota none default
test/parmor reservation none default
test/parmor recordsize 128K default
test/parmor mountpoint /test/parmor default
test/parmor sharenfs off default
test/parmor checksum on default
test/parmor compression off default
test/parmor atime on default
test/parmor devices on default
test/parmor exec on default
test/parmor setuid on default
test/parmor readonly off default
test/parmor zoned off default
test/parmor snapdir hidden default
test/parmor aclmode groupmask default
test/parmor aclinherit secure default
here's an invalid command, but the "help" shows inheritance, how to view similar in a valid way?
root@x4500-1 # zfs get test
invalid property 'test'
get [-rHp] [-o field[,field]...] [-s source[,source]...]
The following properties are supported:
PROPERTY EDIT INHERIT VALUES
type NO NO filesystem | volume | snapshot
creation NO NO
used NO NO
available NO NO
referenced NO NO
compressratio NO NO <1.00x or higher if compressed>
mounted NO NO yes | no | -
origin NO NO
quota YES NO | none
reservation YES NO | none
volsize YES NO
volblocksize NO NO 512 to 128k, power of 2
recordsize YES YES 512 to 128k, power of 2
mountpoint YES YES | legacy | none
sharenfs YES YES on | off | share(1M) options
checksum YES YES on | off | fletcher2 | fletcher4 | sha256
compression YES YES on | off | lzjb
atime YES YES on | off
devices YES YES on | off
exec YES YES on | off
setuid YES YES on | off
readonly YES YES on | off
zoned YES YES on | off
snapdir YES YES hidden | visible
aclmode YES YES discard | groupmask | passthrough
aclinherit YES YES discard | noallow | secure | passthrough
Sizes are specified in bytes with standard units such as K, M, G, etc.
Administering zpools and fs's
- The network ports are located on the back of the systems, arranged in a 2x2 block, and are labeled ports 0 1 2 3. These will correspond to the device names in Solaris (ours will be referred to as e1000gX, where X=0-3, since our systems are using Intel e1000's)
- To use MTU's larger than 1500, must edit /kernel/drv/e1000g.conf, change the line to "MaxFrameSize=3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3;" (doc's there will explain the bitmasking). I think the system then needs to be rebooted; I have not found the proper order of downing the device and unplum'ing it to get it to work. NOTE, THIS WILL MOST LIKELY GET BLOWN AWAY WITH A KERNEL UPDATE!
- Edit /etc/hostname.e1000g* and edit to look something like "x4500-0 mtu 4500"
- edit /etc/inet/[hosts ipnodes netmasks ntp.conf] and /etc/[defaultrouter resolv.conf nsswitch.conf hostname.e1000g* defaultdomain] appropriately
Simple OS installation notes
- DVD ISO for Solaris 10, x86_64, u4 or 08/07 can be found at nemo:/export1/Solaris10/sol-10-GA-x86-dvd.iso (Note, NOT in CVS)
- The patch cluster I've been installing can be found at nemo:/export1/Solaris10/10u4/updates-10-26/10_x86_Recommended.zip (Note, NOT in CVS)
- BMC/SP, I set the IP addr via dhcp (MAC on back of unit, as per Sun docs), I added it to dhcpd.conf on nemo, restarted DHCP on nemo, plugged the SP into the nemo network, and Voila!
- Connected USB DVD drive with Sol 10 11/06 DVD
- ssh'ed into the SP/BMC with "ssh root@bx4500-1" (the passwd is the Sun default)
- I reset the system power with "reset SYS"
- Connected to the system's serial console via "start SP/console"
- Watched the system POST, and it started to boot off of the DVD, after which I did the following:
from DVD's grub menu, selected:
Solaris Serial Console ttya
** Be patient, very few visual pacifiers early on, don't be worried that it says it a 32-bit
version of the installer **
from Solaris install method screen, chose:
1. Solaris Interactive (default)
it probed tried to configure each of the NIC's
and babbled about setting up Java
selected a language:
selected a term type:
3) DEC VT100
** From here on, one must press Esc-2 when prompted for F2 or Esc-2, the prompts are
inconsistent in this regard **
DHCP for e1000g0:
Hostname for e1000g0: ** Not FQDN, reconfiguration after install of the next number of
settings will be explained later **
IP address for e1000g0:
System part of a subnet:
Netmask for e1000g0
IPv6 for e1000g0
Default Route for e1000g0
[X] Specify one
Router IP address for e1000g0
A chance to review
A chance to review
A chance to review
Country or Region:
[X] United States
[X] Central Time
set the clock, we'll configure NTP later.
Enabling Remote Services:
[X] No ** This only leaves ssh open, we'll open things up later as we see fit, we'll
also have to enable root ssh login in sshd.conf **
enters into jumpstart preconfig and other stuff, we may be able to automate this?
Selected "Standard" install with Esc-2
Eject a CD/DVD
Locale's ** Hit enter to see options under a choice, eg hitting enter here > [ ] North
America, causes other options for North America to expand out **
I installed USA-UTF-8 AND ISO-8859-1
Select System Locale
left default set to POSIX C
Select Products ** Again, Hit enter to see options **
> [ ] Solaris 10 Extra Value Software................. 0.00 MB
[X] Sun Validation Test Suite
No media for addl and select products
[X] Entire Distro
Select Disk ** As of 8/21/07, the machine can only boot off of either-of/both-of-if-mirrored
the disks, with UFS, you must scroll down to find disks that can be booted
off of, these device numbers seem to change between versions of kernel(?),
but with this version, the selectable devices are c6t0d0 and c6t4d0 **
no, however this is answered...
set up at least 11GB for / on c6t0d0s0 11264 MB
set up at least 6GB for /var on c6t0d0s3 6000 MB
set up at least 2GB for swap on c6t0d0s1 2000 MB (more can be added later)
leave overlap of whatever on c6t0d0s2
Watch the pretty progress bar!
After the install completes, detach the drive, then let the system reboot
Installed Xorg server (as opposed to Sun's xserver)
Override domainname, for NFS4, I've mixed yes and no to measure impact:
For more information about how the NFS version 4 default domain
name is derived and its impact, refer to the man pages for nfs(4)
and nfsmapid(1m), and the System Administration Guide: Network
update patches, I'm using a patch cluster I downloaded on 10/29. It can be found at nemo:/export1/Solaris10/10u4/updates-10-26/10_x86_Recommended.zip. It should be copied onto a machine, uncompressed. A readme called CLUSTER_README can be found in the uncompressed directory, and one should "grep PASS CLUSTER_README" will return a key to unlock the install script. Then the install_cluster script is run to apply patches.
To get the OS installed in a mirrored way, required a bit of work. This seemingly can be done in an automated way via jump start, but there isn't an intuitive way going through the manual install. This replicates what we did on my visit to CIT:
Enable root ssh access by editing /etc/ssh/sshd_config and changing PermitRootLogin to yes, then "svcadm restart ssh" to restart the service.
Enable nfs client -- "svcadm enable nfs/client", but then run "svcs nfs/client" to see if it is running)
Disable print server -- "svcadm disable print/server" and"svcadm disable rfc1179")
Enable ntp -- first update /etc/inet/ntp.conf appropriatly, and then "svcadm enable ntp"
Make sure routing disabled, if using multiple interfaces "/usr/sbin/routeadm" should tell state?
If there's a preexisting zpool on the other 46 HDD's and you want to import it, type zpool import, which will scan the disks for anything that might be recovered and return info about it (I have an example of a pool that had been called "test"), and I then did a "zpool import -f test" and it magically found, imported and mounted my preexisting zpool and 2 zfs's (gskelton and parmor).
- cfgadm -- tells what disks are installed on the system)
- metastat -- shows the status of metadevices, which we're setting up. None should exist after a clean install as shown above.
- format c6t0d0 -- this is the first bootable disk in our systems, the disk we installed onto earlier. A partition must be created for the metadata for our metadevice. Enter the format command, then at the format prompt, type partition, at the partition prompt type print (this shows the slices laid out during our initial install), look at the cylinder of last partition you created earlier (var is on the range of 1692 - 2456 in my example), now enter 7 to modify partition 7 (this is what CIT had done), don't enter a partition tag, don't enter permission flags, enter 2457 (ending cylinder above + 1), and enter a size of 32130b. Verify with a print that you have something like "7 unassigned wm 2457 - 2458 15.69MB (2/0/0) 32130". If so, then enter label to write out the new table; if not, try again. Enter quit to get out of the partition menu, then quit again to get out of the format menu.
- format c6t4d0 -- repeat exactly as above.
- prtvtoc /dev/dsk/c6t0d0s2 -- verifies your partitioning. Not s2 is a special slice in Solaris that describes the disk's partitions.
- prtvtoc /dev/dsk/c6t0d0s2 | fmthard -i -s - /dev/rdsk/c6t4d0s2 -- fmthard updates the Volume Table of Contents, this command show's what you're about to copy from c6t0 -> c6t4 and the -i dumps to stdout w/o writing to c6t4 and the -s - reads from stdin (the output of the prtvtoc output). Also note that we're reading from the dsk device and writing to the rdsk device.
- prtvtoc /dev/dsk/c6t0d0s2 | fmthard -s - /dev/rdsk/c6t4d0s2 -- actually does the copy/write
- metadb -a -f c6t0d0s7 c6t4d0s7 -- initializes the metadata db on s7 of each of the disks
- metadb -- should verify we've set something up.
- swap -l -- will list swap info, we're going to start with our mirrored swap partition.
- swap -d /dev/dsk/c6t0d0s1 -- disables our installed swap.
- swap -l -- see, it's gone.
- metainit d20 1 1 c6t0d0s1 -- creates a meta device d20 on s1 of our install disk, this will be a submirror in our mirrored device created below.
- metainit d21 1 1 c6t4d0s1 -- creates d20's friend, our other submirror.
- metastat -- will show we have two meta devices.
- metainit d2 -m d20 d21 -- will create a metadevice d2, which is mirroring d20 and d21 created above (a mirror of the two submirrors).
- metastat -- verifies what I just said.
- swap -a /dev/md/dsk/d2 -- adds our new device as a swap device.
- swap -l -- see.
- cat /etc/vfstab -- shows what we mount on boot.
- vi /etc/vfstab -- copy and edit the swap line such that it now refers to /dev/md/dsk/d2
- metainit -f d10 1 1 c6t0d0s0 -- this may seem creepy, but you're setting up / so that it can be mirrored from t0d0s0 onto t4d0s0, our first submirror.
- metainit -f d11 1 1 c6t4d0s0 -- setting up the other half, or second submirror.
- metainit d1 -m d10 -- add the first half of the mirror.
- metaroot d1 -- sets up vfstab for us, so / is now mounted off of /dev/md/dsk/d1, not sure why we can't yet attache d11 to d1.
- cat /etc/vfstab -- verifies.
- metainit -f d30 1 1 c6t0d0s3 -- sets up the first submirror.
- metainit -f d31 1 1 c6t4d0s3 -- sets up the second.
- metainit d3 -m d30 -- add the first half of the mirror.
- metattach d3 d31 -- attaches the second half.
- from here I manually edited /etc/vfstab, setting it to mount /dev/md/dsk/d3 on var, and telling it how to fsck d3 under the rdsk device. Copy the d1 on / line, and edit appropriately.
- Reboot (init 6) and then do one more step
- login and run metattach d1 d11 -- add the second half of /'s mirror.