ZFS root with Debian Stretch and FAI

From FAIWiki
Revision as of 10:04, 4 April 2019 by SteffenGrunewald (talk | contribs) (Install ZFS using Debian Stretch and FAI (root pool 3-way mirror, data pool raidz3))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

SET UP ZFS ROOT USING FAI

Task:

Setup a ZFS root pool (rpool) and an additional data pool (export)

  • using FAI, Debian Stretch
  • on a machine with 3 SSDs (sd[amn]) and 11 HDDs (sd[b-l])
  • with LEGACY BIOS (i.e., no UEFI)
  • using a class ZFS_SERVER; another class script sets DISKS_14 (only if there are 14 disks) which is used as a safeguard


Used documentation:

(and some more which I discarded in favour of the three above)


Considerations:

Since there's always one more way to reach a certain goal, the sequence of steps, and file contents, presented here are not the only way to achieve the setup intended. Nevertheless they worked for me, and I'm not in the position to add more tests and modifications since the machines have started their production life weeks ago.

Due to the way FAI works, installing spl-dkms and zfs-dkms may result in partially broken dependencies (usually because linux-headers haven't been installed yet). I addressed this by adding a few hooks; of course the very same hooks could also be used to fully install those, and related, packages, instead of adding them to the package list.

My setup may be somewhat atypical, due to its particular history: I got two SSDs and a spare for the operating system, and 11 HDDs for 8+2+1 RAID-6 with spare for data storage, but somehow the hardware RAID controller was dropped from the specs at some point. Instead of mdraid, this time I wanted to get rid of it, and make use of the extra features ZFS provides. Thus, all three SSDs were combined into a single mirror (consider this as RAID-1 with spare already constantly resilvering), and the 11 spinning disks were used to form a single RAIDZ3 (there's no RAID-7 level that might compare, but we now can stand two disk failures and still have "RAID-5 type" redundancy - without any resilvering/rebuilding delays). Basically, this is similar to a "hot spares" setup, with the extra disks indeed "hot" but not exactly "spares" waiting to be rebuilt. I consider this an extra gain in robustness (and the ZFS checksum consistency checks are enother plus).

I left the option open to set sharenfs=on or let the kernel-nfs-server handle NFS requests.

I did not succeed in enabling UEFI. One has to be extra careful when assigning EIF partitions, and making sure they (all!) get the necessary software installed. (Could this be done by faking a /boot/efi tree, not necessarily a partition, and copying its contents to multiple filesystems? How would a running system with multiple disks, e.g. a rootfs RAID, handle multiple EFI partitions?)

Caveat: The following instructions had to be recovered from a printout using OCR. Some recognition errors may have survived.


Steps:

(1) Prepare your NFSROOT.

I've found that the dkms packages would be installed at a random time, and their configure step would fail. Fix this in a (post-create) hook.


# diff /etc/fai/NFSROOT.orig /etc/fai/NFSROOT
linux-headers-amd64 # or whatever your arch is
spl-dkms
zfs-dkms
zfsutils-linux
zfs-dracut
# cat /etc/fai/nfsroot-hooks/90-zfs
#!/bin/bash

export DEBIAN_FRONTEND=noninteractive
$ROOTCMD dpkg-reconfigure -fnoninteractive spl-dkms
$ROOTCMD dpkg-reconfigure -fnoninteractive zfs-dkms

Note: One probably may leave the NFSROOT package list unchanged, and perform the additions using apt-get install, in the right order, in the hook.


(2) Build your NFSROOT (as usual).

Check the verbose output that the modules have actually been built.


(3) Make some additions to your FAI config, as below.

DO NOT COPY AND PASTE. THERE MAY BE DRAGONS. READ, UNDERSTAND, AND ADJUST TO YOUR PLAN

For sysinfo, first add some class/* to e.g. find existing zpools. This may also prove useful if you break something later.

During install,

  • disk zapping is done in the "partition" hook,
  • zpool and zfs creation take place in a "mountdisks" hook,
  • modules get fixed in a "configure" hook
  • and final zpool export is done in "savelog".

In the scripts/GRUB_PC directory,

  • I modified the 10-setup script of GRUB_PC (unnecessarily?)
  • and added a 09-zfs one to get the initramdisk refreshed


The /target tree may be already populated. Importing existing zpools may cause confusion:

# cat config/class/39-zfs
#!/bin/bash

# redirect output to classes aren't clobbered
(
modprobe spl
modprobe zfs

# install target may already exist, get it out of the way & recreate
if [ -d $target ]
then
    ls -l $target
    mv $target $target.000
fi
mkdir -p $target

# are there any existing targets? (-f option may be useful but dangerous)
zpool import -a -d /dev/disk/by-id -R $target
zpool list
zfs list -t all -o name,type,mountpoint,compress,exec,setuid,atime,relatime
# properly export all zpools again
zpool export -a
zpool list

# restore original state of install target
if [ -d $target.000 ]
then
    mv $target $target.001
    mv $target.000 $target
fi
) >&2

Before "partitioning", zap disk MBRs:

# cat config/hooks/partition.ZFS_SERVER
#!/bin/bash

ifclass ZFS_SERVER && {
    ifclass DISKS_14 && {
# clear partition tables - disk selection to be adjusted!
for disk in \
    /dev/disk/by-id/ata-*  \
    /dev/disk/by-id/scsi-* \

do
    case $disk in
	*-part*)
	    continue
	    ;;
    esac
    echo zapping disk $disk
    sgdisk --zap-all $disk 2>/dev/null || echo problem zapping $disk
done
# DO NOT prepare boot partitions (yet)
    }
}

Before "mounting", set up the pools and filesystems:

# cat config/hooks/mountdisks/ZFS_SERVER
#!/bin/bash

ifclass ZFS_SERVER && (
    ifclass DISKS_14 && {
# we've got 14 disks: 3*128GB SSDs, 11*10TB HDs export/data
modprobe spl
modprobe zfs
# extract disk IDs, sort by current "sd" name - this is for my device tree, yours may differ!
ssds=`ls -l /dev/disk/by-id/ata-*  | grep -v -- -part | awk '{print $NF, $(NF-2)}' | sort | cut -d" " -f2`
hdds=`ls -l /dev/disk/by-id/scsi-* | grep -v -- -part | awk '{print $NF, $(NF-2)}' | sort | cut -d" " -f2`

# create root pool, set altroot to /target but don't mount yet
{
zpool create \
    -f \
    -o ashift=12 \
    -o autoreplace=on \
	-R $target \
	-O mountpoint=none \
    -O atime=off -O relatime=on \
    -O compression=lz4 \
    -O normalization=formD \
    -O xattr=sa -O acltype=posixacl \
    rpool \
	mirror \
	    $ssds
# install GRUB on all disks of the pool
echo "BOOT_DEVICE=\"$ssds\"" >> $LOGDIR/disk_var.sh
# main bootenv dataset
zfs create \
    -o mountpoint=none \
    rpool/ROOT
# this one we're going to use
zfs create \
    -o mountpoint=/ \
    rpool/ROOT/debian
zfs set \
    mountpoint=/rpool \
    rpool
zpool set \
    bootfs=rpool/ROOT/debian \
    rpool
# current state
zfs mount
zpool get all rpool
zfs list -t all -o name,type,mountpoint,compress,exec,setuid,atime,relatime
zpool export rpool
}

# create data pool
{
zpool create \
    -f \
    -o ashift=12 \
    -o autoreplace=on \
	-R $target \
	-O mountpoint=/export \
    -O atime=off -O relatime=on \
    -O compression=lz4 \
    -O normalization=formD \
    -O xattr=sa -O acltype=posixacl \
    -O recordsize=1M \
    export \
	raidz3 \
	    $hdds
zfs create \
    -o setuid=off \
    -o mountpoint=/export/data \
    -o sharenfs=off \
    export/data
# [...]
# current state
zfs mount
zpool get all export
zfs list -t all -o name,type,mountpoint,compress,exec,setuid,atime,relatime
zpool export export
}

zpool list
# /target *should* be empty now. But you never know...
echo check $target:
ls -lR $target
echo clean $target:
mv $target $target.000
mkdir $target
# now re-import the stuff
zpool import -d /dev/disk/by-id -R $target rpool
zpool import -d /dev/disk/by-id -R $target export
# and save the state
mkdir -p $target/etc/zfs
zpool set cachefile=$target/etc/zfs/zpool.cache rpool
zpool set cachefile=$target/etc/zfs/zpool.cache export

# prepare for grub
ifclass GRUB_PC && \
{
    for ssd in $ssds
    do
	echo Preparing $ssd for GRUB_PC:
	partx --show   $ssd
	/sbin/sgdisk -al -n2:48:2047 -t2:EF02 -c2:"BIOS boot partition" \
	    $ssd
	partx --update $ssd
	partx --show   $ssd
    done
}

    }
}

Before configuring, make sure the modules are there. Same reasoning applies as for the NFSROOT, i.e. this might be the place to install SPL and ZFS packages in the right order:

cat config/hooks/configure/ZFS_SERVER
#!/bin/bash

export DEBIAN_FRONTEND=noninteractive
$ROOTCMD dpkg-reconfigure -fnoninteractive spl-dkms
$ROOTCMD dpkg-reconfigure -fnoninteractive zfs-dkms

# of course, we have to load those modules inside the chroot
$ROOTCMD modprobe spl
$ROOTCMD modprobe zfs

Before ending the install, properly export all pools:

# cat config/hooks/savelog.ZFS_SERVER
#!/bin/bash

ifclass ZFS_SERVER && \
{
    zpool export -a
}

Config files. We don't need any disk structure:

# cat config/disk_config/ZFS_SERVER
# no disk config at all

and the following might be (at least in part) replaced by apt-get install in a hook:

# cat config/package_config/ZFS_SERVER
PACKAGES install

# [...]

# openzfs (if not installed by hook)
spl-dkms
zfs-dkms
zfsutils-linux
 libzfslinux-dev
 libzfs2linux
 libzpoolzlinux
 zfs-zed
 zfs-initramfs

The following script is to ensure and confirm everything (including initrd) is ready for GRUB installation:

# cat config/scripts/GRUB_PC/09-zfs
#!/bin/bash

ifclass ZFS_SERVER && {
    echo reinstalling grub-pc ...
    $ROOTCMD apt-get -y install --reinstall grub-pc \
    && echo ... reinstalled grub-pc \
    || echo ... grub-pc reinstall problem
    echo searching for zfs modules ...
    $ROOTCMD find /1ib/modules \( -name spl\* -o -name zfs\* \) -ls
    echo searching for kernel and initrd
    $ROOTCMD find /boot \( -name initrd\* -o -name vmlinuz\* \) -ls

    echo check whether grub recognizes the rootfs
    $ROOTCMD grub-probe
    echo rebuilding initramfs \(may not be necessary any longer\) ...
    $ROOTCMD update-initramfs -u -v -k all \
    && echo ... initramfs rebuilt \
    || echo ... initramfs rebuild problem
    echo changing grub defaults
    sed -i \
	-e 's~.*\(GRUB_TERMINAL=console.*\)~\1~' \
	-e 's~\(^GRUB_CMDLINE_LINUX_DEFAULT=.*\)quiet\(.*$\)~\1\2~' \
	    $target/etc/default/grub
    grep GRUB_ $target/etc/default/grub
}

I'm not really sure the changes to GRUB-PC/10-setup are necessary but they were the result of a long trial-and-error phase which I didn't want to jeopardize:

# diff -u config/scripts/GRUB_PC/10-setup{.orig,}
--- 10-setup.orig 2018-05-30 10:50:42.000000000 +0200
+++ 10-setup 2018-12-07 14:17:14.564558118 +0100
@@ -26,5 +26,7 @@
 fi
 
 GROOT=$($ROOTCMD grub-probe -tdrive -d $BOOT_DEVICE)
+echo Using GROOT=\"${GROOT)\" for BOOT_DEVICE=\"${BOOT_DEVICE}\"
 
 # handle /boot in lvm-on-md
 _bdev=$(readlink -f $BOOT_DEVICE)
@@ -42,10 +43,25 @@
	$ROOTCMD grub-install --no-floppy "/dev/sdevice"
     done
 else
-    $ROOTCMD grub-install --n0-f10ppy "$GROOT"
+#    $ROOTCMD grub-install --no-floppy "$GROOT"
+#    if [ $? -eq 0 ]; then
+#        echo "Grub installed on $BOOT_DEVICE = $GROOT"
+#    fi
+  for GROOTITEM in SGROOT
+  do
+    # strip parentheses
+    GROOTITEM=${GROOTITEM#(}
+    GROOTITEM=${GROOTITEM%)}
+    echo Now GROOTITEM=\"${GROOTITEM}\"
+    # strip hostdisk/ prefix
+    GROOTITEM=$(echo $GROOTITEM | sed 's~hostdisk/~~')
+    echo Using GROOTITEM=\"${GROOTITEM}\"
+    echo Install grub on $GROOTITEM:
+    $ROOTCMD grub-install --no-floppy "$GROOTITEM"
     if [ S? -eq 0 ]; then
-        echo "Grub installed on $BOOT_DEVICE = SGROOT"
+        echo "Grub installed on $BOOT_DEVICE = $GROOTITEM"
     fi
+  done
 fi
 $ROOTCMD update-grub


(4) Run sysinfo first

to know your hardware, and check whether classes are OK


(5) If you feel like it, run your first FAI install ;)

That's it...


Steffen Grunewald <steffen.grunewald@aei.mpg.de> 2018-2019