ZFS root with Debian Stretch and FAI
SET UP ZFS ROOT USING FAI
Task:
Setup a ZFS root pool (rpool) and an additional data pool (export)
- using FAI, Debian Stretch
- on a machine with 3 SSDs (sd[amn]) and 11 HDDs (sd[b-l])
- with LEGACY BIOS (i.e., no UEFI)
- using a class ZFS_SERVER; another class script sets DISKS_14 (only if there are 14 disks) which is used as a safeguard
Used documentation:
- https://github.com/zfsonlinux/zfs/wiki/Debian-Stretch-Root-on-ZFS
- http://www.thecrosseroads.net/2016/02/booting-a-zfs-root-via-uefi-on-debian
- https://www.funtoo.org/ZFS_Install_Guide
(and some more which I discarded in favour of the three above)
Considerations:
Since there's always one more way to reach a certain goal, the sequence of steps, and file contents, presented here are not the only way to achieve the setup intended. Nevertheless they worked for me, and I'm not in the position to add more tests and modifications since the machines have started their production life weeks ago.
Due to the way FAI works, installing spl-dkms and zfs-dkms may result in partially broken dependencies (usually because linux-headers haven't been installed yet). I addressed this by adding a few hooks; of course the very same hooks could also be used to fully install those, and related, packages, instead of adding them to the package list.
My setup may be somewhat atypical, due to its particular history: I got two SSDs and a spare for the operating system, and 11 HDDs for 8+2+1 RAID-6 with spare for data storage, but somehow the hardware RAID controller was dropped from the specs at some point. Instead of mdraid, this time I wanted to get rid of it, and make use of the extra features ZFS provides. Thus, all three SSDs were combined into a single mirror (consider this as RAID-1 with spare already constantly resilvering), and the 11 spinning disks were used to form a single RAIDZ3 (there's no RAID-7 level that might compare, but we now can stand two disk failures and still have "RAID-5 type" redundancy - without any resilvering/rebuilding delays). Basically, this is similar to a "hot spares" setup, with the extra disks indeed "hot" but not exactly "spares" waiting to be rebuilt. I consider this an extra gain in robustness (and the ZFS checksum consistency checks are enother plus).
I left the option open to set sharenfs=on or let the kernel-nfs-server handle NFS requests.
I did not succeed in enabling UEFI. One has to be extra careful when assigning EIF partitions, and making sure they (all!) get the necessary software installed. (Could this be done by faking a /boot/efi tree, not necessarily a partition, and copying its contents to multiple filesystems? How would a running system with multiple disks, e.g. a rootfs RAID, handle multiple EFI partitions?)
Caveat: The following instructions had to be recovered from a printout using OCR. Some recognition errors may have survived.
Steps:
(1) Prepare your NFSROOT.
I've found that the dkms packages would be installed at a random time, and their configure step would fail. Fix this in a (post-create) hook.
# diff /etc/fai/NFSROOT.orig /etc/fai/NFSROOT linux-headers-amd64 # or whatever your arch is spl-dkms zfs-dkms zfsutils-linux zfs-dracut
# cat /etc/fai/nfsroot-hooks/90-zfs #!/bin/bash export DEBIAN_FRONTEND=noninteractive $ROOTCMD dpkg-reconfigure -fnoninteractive spl-dkms $ROOTCMD dpkg-reconfigure -fnoninteractive zfs-dkms
Note: One probably may leave the NFSROOT package list unchanged, and perform the additions using apt-get install, in the right order, in the hook.
(2) Build your NFSROOT (as usual).
Check the verbose output that the modules have actually been built.
(3) Make some additions to your FAI config, as below.
DO NOT COPY AND PASTE. THERE MAY BE DRAGONS. READ, UNDERSTAND, AND ADJUST TO YOUR PLAN
For sysinfo, first add some class/* to e.g. find existing zpools. This may also prove useful if you break something later.
During install,
- disk zapping is done in the "partition" hook,
- zpool and zfs creation take place in a "mountdisks" hook,
- modules get fixed in a "configure" hook
- and final zpool export is done in "savelog".
In the scripts/GRUB_PC directory,
- I modified the 10-setup script of GRUB_PC (unnecessarily?)
- and added a 09-zfs one to get the initramdisk refreshed
The /target tree may be already populated. Importing existing zpools may cause confusion:
# cat config/class/39-zfs #!/bin/bash # redirect output to classes aren't clobbered ( modprobe spl modprobe zfs # install target may already exist, get it out of the way & recreate if [ -d $target ] then ls -l $target mv $target $target.000 fi mkdir -p $target # are there any existing targets? (-f option may be useful but dangerous) zpool import -a -d /dev/disk/by-id -R $target zpool list zfs list -t all -o name,type,mountpoint,compress,exec,setuid,atime,relatime # properly export all zpools again zpool export -a zpool list # restore original state of install target if [ -d $target.000 ] then mv $target $target.001 mv $target.000 $target fi ) >&2
Before "partitioning", zap disk MBRs:
# cat config/hooks/partition.ZFS_SERVER #!/bin/bash ifclass ZFS_SERVER && { ifclass DISKS_14 && { # clear partition tables - disk selection to be adjusted! for disk in \ /dev/disk/by-id/ata-* \ /dev/disk/by-id/scsi-* \ do case $disk in *-part*) continue ;; esac echo zapping disk $disk sgdisk --zap-all $disk 2>/dev/null || echo problem zapping $disk done # DO NOT prepare boot partitions (yet) } }
Before "mounting", set up the pools and filesystems:
# cat config/hooks/mountdisks/ZFS_SERVER #!/bin/bash ifclass ZFS_SERVER && ( ifclass DISKS_14 && { # we've got 14 disks: 3*128GB SSDs, 11*10TB HDs export/data modprobe spl modprobe zfs # extract disk IDs, sort by current "sd" name - this is for my device tree, yours may differ! ssds=`ls -l /dev/disk/by-id/ata-* | grep -v -- -part | awk '{print $NF, $(NF-2)}' | sort | cut -d" " -f2` hdds=`ls -l /dev/disk/by-id/scsi-* | grep -v -- -part | awk '{print $NF, $(NF-2)}' | sort | cut -d" " -f2` # create root pool, set altroot to /target but don't mount yet { zpool create \ -f \ -o ashift=12 \ -o autoreplace=on \ -R $target \ -O mountpoint=none \ -O atime=off -O relatime=on \ -O compression=lz4 \ -O normalization=formD \ -O xattr=sa -O acltype=posixacl \ rpool \ mirror \ $ssds # install GRUB on all disks of the pool echo "BOOT_DEVICE=\"$ssds\"" >> $LOGDIR/disk_var.sh # main bootenv dataset zfs create \ -o mountpoint=none \ rpool/ROOT # this one we're going to use zfs create \ -o mountpoint=/ \ rpool/ROOT/debian zfs set \ mountpoint=/rpool \ rpool zpool set \ bootfs=rpool/ROOT/debian \ rpool # current state zfs mount zpool get all rpool zfs list -t all -o name,type,mountpoint,compress,exec,setuid,atime,relatime zpool export rpool } # create data pool { zpool create \ -f \ -o ashift=12 \ -o autoreplace=on \ -R $target \ -O mountpoint=/export \ -O atime=off -O relatime=on \ -O compression=lz4 \ -O normalization=formD \ -O xattr=sa -O acltype=posixacl \ -O recordsize=1M \ export \ raidz3 \ $hdds zfs create \ -o setuid=off \ -o mountpoint=/export/data \ -o sharenfs=off \ export/data # [...] # current state zfs mount zpool get all export zfs list -t all -o name,type,mountpoint,compress,exec,setuid,atime,relatime zpool export export } zpool list # /target *should* be empty now. But you never know... echo check $target: ls -lR $target echo clean $target: mv $target $target.000 mkdir $target # now re-import the stuff zpool import -d /dev/disk/by-id -R $target rpool zpool import -d /dev/disk/by-id -R $target export # and save the state mkdir -p $target/etc/zfs zpool set cachefile=$target/etc/zfs/zpool.cache rpool zpool set cachefile=$target/etc/zfs/zpool.cache export # prepare for grub ifclass GRUB_PC && \ { for ssd in $ssds do echo Preparing $ssd for GRUB_PC: partx --show $ssd /sbin/sgdisk -al -n2:48:2047 -t2:EF02 -c2:"BIOS boot partition" \ $ssd partx --update $ssd partx --show $ssd done } } }
Before configuring, make sure the modules are there. Same reasoning applies as for the NFSROOT, i.e. this might be the place to install SPL and ZFS packages in the right order:
cat config/hooks/configure/ZFS_SERVER #!/bin/bash export DEBIAN_FRONTEND=noninteractive $ROOTCMD dpkg-reconfigure -fnoninteractive spl-dkms $ROOTCMD dpkg-reconfigure -fnoninteractive zfs-dkms # of course, we have to load those modules inside the chroot $ROOTCMD modprobe spl $ROOTCMD modprobe zfs
Before ending the install, properly export all pools:
# cat config/hooks/savelog.ZFS_SERVER #!/bin/bash ifclass ZFS_SERVER && \ { zpool export -a }
Config files. We don't need any disk structure:
# cat config/disk_config/ZFS_SERVER # no disk config at all
and the following might be (at least in part) replaced by apt-get install in a hook:
# cat config/package_config/ZFS_SERVER PACKAGES install # [...] # openzfs (if not installed by hook) spl-dkms zfs-dkms zfsutils-linux libzfslinux-dev libzfs2linux libzpoolzlinux zfs-zed zfs-initramfs
The following script is to ensure and confirm everything (including initrd) is ready for GRUB installation:
# cat config/scripts/GRUB_PC/09-zfs #!/bin/bash ifclass ZFS_SERVER && { echo reinstalling grub-pc ... $ROOTCMD apt-get -y install --reinstall grub-pc \ && echo ... reinstalled grub-pc \ || echo ... grub-pc reinstall problem echo searching for zfs modules ... $ROOTCMD find /1ib/modules \( -name spl\* -o -name zfs\* \) -ls echo searching for kernel and initrd $ROOTCMD find /boot \( -name initrd\* -o -name vmlinuz\* \) -ls echo check whether grub recognizes the rootfs $ROOTCMD grub-probe echo rebuilding initramfs \(may not be necessary any longer\) ... $ROOTCMD update-initramfs -u -v -k all \ && echo ... initramfs rebuilt \ || echo ... initramfs rebuild problem echo changing grub defaults sed -i \ -e 's~.*\(GRUB_TERMINAL=console.*\)~\1~' \ -e 's~\(^GRUB_CMDLINE_LINUX_DEFAULT=.*\)quiet\(.*$\)~\1\2~' \ $target/etc/default/grub grep GRUB_ $target/etc/default/grub }
I'm not really sure the changes to GRUB-PC/10-setup are necessary but they were the result of a long trial-and-error phase which I didn't want to jeopardize:
# diff -u config/scripts/GRUB_PC/10-setup{.orig,} --- 10-setup.orig 2018-05-30 10:50:42.000000000 +0200 +++ 10-setup 2018-12-07 14:17:14.564558118 +0100 @@ -26,5 +26,7 @@ fi GROOT=$($ROOTCMD grub-probe -tdrive -d $BOOT_DEVICE) +echo Using GROOT=\"${GROOT)\" for BOOT_DEVICE=\"${BOOT_DEVICE}\" # handle /boot in lvm-on-md _bdev=$(readlink -f $BOOT_DEVICE) @@ -42,10 +43,25 @@ $ROOTCMD grub-install --no-floppy "/dev/sdevice" done else - $ROOTCMD grub-install --n0-f10ppy "$GROOT" +# $ROOTCMD grub-install --no-floppy "$GROOT" +# if [ $? -eq 0 ]; then +# echo "Grub installed on $BOOT_DEVICE = $GROOT" +# fi + for GROOTITEM in SGROOT + do + # strip parentheses + GROOTITEM=${GROOTITEM#(} + GROOTITEM=${GROOTITEM%)} + echo Now GROOTITEM=\"${GROOTITEM}\" + # strip hostdisk/ prefix + GROOTITEM=$(echo $GROOTITEM | sed 's~hostdisk/~~') + echo Using GROOTITEM=\"${GROOTITEM}\" + echo Install grub on $GROOTITEM: + $ROOTCMD grub-install --no-floppy "$GROOTITEM" if [ S? -eq 0 ]; then - echo "Grub installed on $BOOT_DEVICE = SGROOT" + echo "Grub installed on $BOOT_DEVICE = $GROOTITEM" fi + done fi $ROOTCMD update-grub
(4) Run sysinfo first
to know your hardware, and check whether classes are OK
(5) If you feel like it, run your first FAI install ;)
That's it...
Steffen Grunewald <steffen.grunewald@aei.mpg.de> 2018-2019