- 1 Gravity Wave Group, University of Birmingham, UK
- 1.1 Pre Upgrade
- 1.2 The Upgrade
- 1.3 Post Upgrade
- 1.4 TODO
Gravity Wave Group, University of Birmingham, UK
A 106 node beowulf cluster devoted to gravitational wave data analysis.
Each node is a dual-xeon 2.33Ghz processor machine with gigabit ethernet.
Two frontend servers, one acting as the head node and the other unused.
Was running Redhat 7.2 using Condor as the cluster manager.
Reasons for upgrade
The current operating system on the machines was becoming rapidly obsolete.
The cluster needed reorganising and updating (we weren't using the second frontend server for example, nor any bar one of five 2TB RAID arrays)
Why Use FAI?
We wanted to use Debian for our cluster because of Debian's ease of admin through its apt package management system.
FAI is the automatic installer for Debian, therefore we were going to use FAI as we'd used Kickstart for Redhat.
There were an aweful lot of machines. We didn't want to do them all individually.
Disaster recovery - the full time sysadmin left leaving it in the hands of us PhD students. For the sake of my PhD I did not want to reinstall the cluster in the future if anything went wrong.
Plan of Attack
- to use FAI to install our nodes
- reinstall any failed nodes during the course of the cluster's operation.
- use FAI to install the servers, note, the head node would have FAI installed to handle the node setup.
- to make sure that the servers were disaster recoverable as well as the nodes
Components of the Install:
- DHCP (for booting the nodes with correct name and ID)
- DNS (node names)
- Condor (the job manager we chose to use for the cluster)
- NIS (unify logins across the cluster)
- NFS (make disks available around the cluster)
- FAI (have the head node reinstall the client nodes, allow control of the node config based on which frontend server was to be the head node)
- ssh passwordless login (extremely useful across the cluster for fixing any mistake at a later date)
Who Did The Upgrade?
One first year PhD student with no previous sys-admin experience doing the script writing and testing. I had to learn how to bash script, perl script and cfengine script (knowing php helped immensely with perl scripting and therefore I wasn't scripting from cold). Also, I had to learn about DHCP, NIS, Condor and NFS pretty much from scratch. Fortunately, there were the already existing cluster configuration to use as a base for the new one which made the job doable. But, yes, it's possible to set up FAI knowing very little, it just takes a long time...
Otherwise, I had another PhD student again with no sys-admin experience to do odd helping out and if I say we, I'm refering to the both of us.
How long did it take to prepare?
It took me three weeks to write the node setup (the worst of it was figuring how to install the kernel and grub, which I ballsed up nicely)
It took me two months to write the server setup which created the node setup. I seriously underestimated how hard this would be and the amount of testing it took. Testing was done on our second frontend server and six of the nodes, leaving the rest of the cluster operational.
How long did the upgrade itself take?
The install took about over an hour on the head server (including a format of a 2TB disk, which was where most of the delay was). We went in cold to the head server and of course there was a typo somewhere and the install failed somewhere near the end. It was fixable, but we decided not to bother and fixed the installer and ran it again as a lazy option.
The test head server took 18 minutes to reinstall if the RAID arrays were not formatted.
The nodes took about 10-12 minutes total when reinstalled on mass. Individually, a node takes about 4 minutes to go down and come back up fully reinstalled and FAI lists the install time as two minutes (the rest was rebooting).
Everything we wanted the cluster to do we got it to do in the FAI install except for a couple of points:
- we couldn't get FAI's fai-setup to work automatically from an init.d level 2 post install script
- installing the NIS master requires user input and by that point we didn't care
Yes, we missed things and left some things out and the penalty for this was about three days of chaos as users found things we hadn't forseen or had just plain forgotten. Pretty much every problem could be solved by using the ssh passwordless login to run simple sripts to sort things out (mainly cfengine find-and-replace scripts)
Long-Term Time Saving
Since the reinstall, we have had:
- 2 nodes disk failures
- one random apt-get error which made no sense and with FAI it was simpler to reinstall the node than to spend time fixing
Current estimate is that we have not saved time overall using FAI, although this may change
It is worth remembering that I had to learn much more than the average sys-admin to install and configure FAI and I think, assuming a sys-admin who knew how to script and the basics of common Linux services, FAI would have a net saving for our cluster already
My personal impression of FAI is that it's a very nice system. The documentation is very good and the system works very smoothly and is well thought out, along with using the powerful and flexible class system for its basis. The only quibble I had was how buggy some of the install scripts were and how poorly documented the install scripts on FAI 2.7 were. I think this wouldn't be a problem so much for a more experienced sys-admin with an actual idea of how to get a linux system working from scratch beyon sticking in a Debian CD and answering the right queries. However, if FAI matures a lot more, I'd love to see official configs that are current, work with current packages in Debian, well documented and easy to configure and use.
In this section only, to do is:
Reread unchanged parts for sense
Reread whole for cinsistency and spelling
Add picture of the cluster.