- 1 Gravity Wave Group, University of Birmingham, UK
Gravity Wave Group, University of Birmingham, UK
A 106 node beowulf cluster devoted to gravitational wave data analysis.
Each node is a dual-xeon 2.33Ghz processor machine with gigabit ethernet.
Two frontend servers, one acting as the head node and the other unused.
Was running Redhat 7.2 using Condor as the cluster manager.
Reasons for upgrade
The operating system on the machines was becoming rapidly obsolete.
The cluster needed reorganising and updating (we weren't using the second frontend server for example, nor any bar one of five 2TB RAID arrays)
Why Use FAI?
We wanted to use Debian for our cluster because of Debian's ease of admin through its apt package management system.
FAI is the automatic installer for Debian, therefore we were going to use FAI as we'd used Kickstart for Redhat.
There were an aweful lot of machines. We didn't want to do them all individually.
Disaster recovery - the full time sysadmin left leaving it in the hands of us PhD students. For the sake of my PhD I did not want to reinstall the cluster in the future if anything went wrong.
Plan of Attack
Initially: to use FAI to install our nodes and reinstall any that failed during the course of the cluster's operation. Later: to use FAI to install the servers that would use FAI to install the nodes, to make sure that the servers were disaster recoverable as well as the nodes (installers installing installers...).
Who Did The Upgrade?
One first year PhD student with no previous sys-admin experience doing the script writing and testing. I had to learn how to bash script, perl script and cfengine script (ok, knowing some php helped with the perl no end). Also, I had to learn about DHCP, NIS, Condor and NFS from pretty much scratch. Fortunately, there were the already existing cluster configuration to use as a base for the new one so that doesn't sound as big a deal as it might. But, yes, it's possible to set up FAI knowing very little. It just takes a long time...
Otherwise, I had another PhD student again with no sys-admin experience to do odd helping out and if I say we, I'm refering to the both of us.
How long did it take to prepare?
Three weeks to write the node setup (the worst of it was figuring how to install the kernel and grub, which I ballsed up nicely) Two months to write the server setup which created the node setup. I seriously underestimated how hard this would be and the amount of testing it took.
Testing was done on our second frontend server and six of the nodes, leaving the rest of the cluster operational.
How long did the actual upgrade take?
The install took about over an hour on the head server (including a format of a 2TB disk, which was where most of the delay was). We went in cold to the head server and of course there was a typo somewhere and the install failed somewhere near the end. It was fixable, but we decided not to bother and fixed the installer and ran it again as a lazy option.
The test head server took 18 minutes to reinstall if the RAID arrays were not formatted.
The nodes took about 10-12 minutes total when reinstalled on mass. Individually, a node takes about 4 minutes to go down and come back up filly reinstalled and FAI lists the install time as two minutes (the rest was rebooting).
Yes, we missed things and left some things out, but otherwise, the day of the reinstall went quite smoothly and every mistake has been rectified in the FAI installer. Admittedly, it's not now possible to test the installer for the servers, but I do sleep better at night knowing I don't have to go through that again. And, yes, it took a long time.
In this section only, to do is:
Reread this for sense, consistency and readability.
Add picture of the cluster.