Success Stories: Difference between revisions
(Added my experience of FAI) |
(remove link to presscoverage (page removed)) |
||
(5 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
== FAI for the OLPC project == | |||
Not yet a success story, but in the works: Holger Levsen is working on a system to install the Fedora/RedHat-based One Laptop Per Child servers with FAI softupdates. | |||
See it documented here: http://wiki.laptop.org/go/User:Holger/FAI | |||
== Gravity Wave Group, University of Birmingham, UK == | == Gravity Wave Group, University of Birmingham, UK == | ||
Line 15: | Line 20: | ||
==== Reasons for upgrade ==== | ==== Reasons for upgrade ==== | ||
The operating system on the machines was becoming rapidly obsolete. | The current operating system on the machines was becoming rapidly obsolete. | ||
The cluster needed reorganising and updating (we weren't using the second frontend server for example, nor any bar one of five 2TB RAID arrays) | The cluster needed reorganising and updating (we weren't using the second frontend server for example, nor any bar one of five 2TB RAID arrays) | ||
Line 33: | Line 38: | ||
==== Plan of Attack ==== | ==== Plan of Attack ==== | ||
Initially: to use FAI to install our nodes | Initially: | ||
Later: | * to use FAI to install our nodes | ||
* reinstall any failed nodes during the course of the cluster's operation. | |||
Later: | |||
* use FAI to install the servers, note, the head node would have FAI installed to handle the node setup. | |||
* to make sure that the servers were disaster recoverable as well as the nodes | |||
Components of the Install: | |||
* DHCP (for booting the nodes with correct name and ID) | |||
* DNS (node names) | |||
* Condor (the job manager we chose to use for the cluster) | |||
* NIS (unify logins across the cluster) | |||
* NFS (make disks available around the cluster) | |||
* FAI (have the head node reinstall the client nodes, allow control of the node config based on which frontend server was to be the head node) | |||
* ssh passwordless login (extremely useful across the cluster for fixing any mistake at a later date) | |||
==== Who Did The Upgrade? ==== | ==== Who Did The Upgrade? ==== | ||
One first year PhD student with no previous sys-admin experience doing the script writing and testing. I had to learn how to bash script, perl script and cfengine script ( | One first year PhD student with no previous sys-admin experience doing the script writing and testing. I had to learn how to bash script, perl script and cfengine script (knowing php helped immensely with perl scripting and therefore I wasn't scripting from cold). Also, I had to learn about DHCP, NIS, Condor and NFS pretty much from scratch. Fortunately, there were the already existing cluster configuration to use as a base for the new one which made the job doable. But, yes, it's possible to set up FAI knowing very little, it just takes a long time... | ||
Otherwise, I had another PhD student again with no sys-admin experience to do odd helping out and if I say we, I'm refering to the both of us. | Otherwise, I had another PhD student again with no sys-admin experience to do odd helping out and if I say we, I'm refering to the both of us. | ||
Line 44: | Line 63: | ||
==== How long did it take to prepare? ==== | ==== How long did it take to prepare? ==== | ||
It took me three weeks to write the node setup (the worst of it was figuring how to install the kernel and grub, which I ballsed up nicely) | |||
Testing was done on our second frontend server and six of the nodes, leaving the rest of the cluster operational. | It took me two months to write the server setup which created the node setup. I seriously underestimated how hard this would be and the amount of testing it took. Testing was done on our second frontend server and six of the nodes, leaving the rest of the cluster operational. | ||
==== How long did the | ==== How long did the upgrade itself take? ==== | ||
The install took about over an hour on the head server (including a format of a 2TB disk, which was where most of the delay was). We went in cold to the head server and of course there was a typo somewhere and the install failed somewhere near the end. It was fixable, but we decided not to bother and fixed the installer and ran it again as a lazy option. | The install took about over an hour on the head server (including a format of a 2TB disk, which was where most of the delay was). We went in cold to the head server and of course there was a typo somewhere and the install failed somewhere near the end. It was fixable, but we decided not to bother and fixed the installer and ran it again as a lazy option. | ||
Line 55: | Line 73: | ||
The test head server took 18 minutes to reinstall if the RAID arrays were not formatted. | The test head server took 18 minutes to reinstall if the RAID arrays were not formatted. | ||
The nodes took about 10-12 minutes total when reinstalled on mass. Individually, a node takes about 4 minutes to go down and come back up | The nodes took about 10-12 minutes total when reinstalled on mass. Individually, a node takes about 4 minutes to go down and come back up fully reinstalled and FAI lists the install time as two minutes (the rest was rebooting). | ||
=== Post Upgrade === | === Post Upgrade === | ||
====Success==== | |||
==== TODO | Everything we wanted the cluster to do we got it to do in the FAI install except for a couple of points: | ||
* we couldn't get FAI's fai-setup to work automatically from an init.d level 2 post install script | |||
* installing the NIS master requires user input and by that point we didn't care | |||
====Missed Things==== | |||
Yes, we missed things and left some things out and the penalty for this was about three days of chaos as users found things we hadn't forseen or had just plain forgotten. Pretty much every problem could be solved by using the ssh passwordless login to run simple sripts to sort things out (mainly cfengine find-and-replace scripts) | |||
====Long-Term Time Saving==== | |||
Since the reinstall, we have had: | |||
* 2 nodes disk failures | |||
* one random apt-get error which made no sense and with FAI it was simpler to reinstall the node than to spend time fixing | |||
Current estimate is that we have not saved time overall using FAI, although this may change | |||
It is worth remembering that I had to learn much more than the average sys-admin to install and configure FAI and I think, assuming a sys-admin who knew how to script and the basics of common Linux services, FAI would have a net saving for our cluster already | |||
====Personal Impression==== | |||
My personal impression of FAI is that it's a very nice system. The documentation is very good and the system works very smoothly and is well thought out, along with using the powerful and flexible class system for its basis. The only quibble I had was how buggy some of the install scripts were and how poorly documented the install scripts on FAI 2.7 were. I think this wouldn't be a problem so much for a more experienced sys-admin with an actual idea of how to get a linux system working from scratch beyon sticking in a Debian CD and answering the right queries. However, if FAI matures a lot more, I'd love to see official configs that are current, work with current packages in Debian, well documented and easy to configure and use. | |||
=== TODO === | |||
In this section only, to do is: | In this section only, to do is: | ||
Reread | Reread unchanged parts for sense | ||
Reread whole for cinsistency and spelling | |||
Add picture of the cluster. | Add picture of the cluster. | ||
==Would you mind sharing your experiences?== | |||
If you would like to share your experiences, please, contact [[AlexanderWait | me]]. |
Latest revision as of 12:09, 12 March 2011
FAI for the OLPC project
Not yet a success story, but in the works: Holger Levsen is working on a system to install the Fedora/RedHat-based One Laptop Per Child servers with FAI softupdates.
See it documented here: http://wiki.laptop.org/go/User:Holger/FAI
Gravity Wave Group, University of Birmingham, UK
Pre Upgrade
Facility
A 106 node beowulf cluster devoted to gravitational wave data analysis.
Each node is a dual-xeon 2.33Ghz processor machine with gigabit ethernet.
Two frontend servers, one acting as the head node and the other unused.
Was running Redhat 7.2 using Condor as the cluster manager.
Reasons for upgrade
The current operating system on the machines was becoming rapidly obsolete.
The cluster needed reorganising and updating (we weren't using the second frontend server for example, nor any bar one of five 2TB RAID arrays)
Why Use FAI?
We wanted to use Debian for our cluster because of Debian's ease of admin through its apt package management system.
FAI is the automatic installer for Debian, therefore we were going to use FAI as we'd used Kickstart for Redhat.
There were an aweful lot of machines. We didn't want to do them all individually.
Disaster recovery - the full time sysadmin left leaving it in the hands of us PhD students. For the sake of my PhD I did not want to reinstall the cluster in the future if anything went wrong.
The Upgrade
Plan of Attack
Initially:
- to use FAI to install our nodes
- reinstall any failed nodes during the course of the cluster's operation.
Later:
- use FAI to install the servers, note, the head node would have FAI installed to handle the node setup.
- to make sure that the servers were disaster recoverable as well as the nodes
Components of the Install:
- DHCP (for booting the nodes with correct name and ID)
- DNS (node names)
- Condor (the job manager we chose to use for the cluster)
- NIS (unify logins across the cluster)
- NFS (make disks available around the cluster)
- FAI (have the head node reinstall the client nodes, allow control of the node config based on which frontend server was to be the head node)
- ssh passwordless login (extremely useful across the cluster for fixing any mistake at a later date)
Who Did The Upgrade?
One first year PhD student with no previous sys-admin experience doing the script writing and testing. I had to learn how to bash script, perl script and cfengine script (knowing php helped immensely with perl scripting and therefore I wasn't scripting from cold). Also, I had to learn about DHCP, NIS, Condor and NFS pretty much from scratch. Fortunately, there were the already existing cluster configuration to use as a base for the new one which made the job doable. But, yes, it's possible to set up FAI knowing very little, it just takes a long time...
Otherwise, I had another PhD student again with no sys-admin experience to do odd helping out and if I say we, I'm refering to the both of us.
How long did it take to prepare?
It took me three weeks to write the node setup (the worst of it was figuring how to install the kernel and grub, which I ballsed up nicely)
It took me two months to write the server setup which created the node setup. I seriously underestimated how hard this would be and the amount of testing it took. Testing was done on our second frontend server and six of the nodes, leaving the rest of the cluster operational.
How long did the upgrade itself take?
The install took about over an hour on the head server (including a format of a 2TB disk, which was where most of the delay was). We went in cold to the head server and of course there was a typo somewhere and the install failed somewhere near the end. It was fixable, but we decided not to bother and fixed the installer and ran it again as a lazy option.
The test head server took 18 minutes to reinstall if the RAID arrays were not formatted.
The nodes took about 10-12 minutes total when reinstalled on mass. Individually, a node takes about 4 minutes to go down and come back up fully reinstalled and FAI lists the install time as two minutes (the rest was rebooting).
Post Upgrade
Success
Everything we wanted the cluster to do we got it to do in the FAI install except for a couple of points:
- we couldn't get FAI's fai-setup to work automatically from an init.d level 2 post install script
- installing the NIS master requires user input and by that point we didn't care
Missed Things
Yes, we missed things and left some things out and the penalty for this was about three days of chaos as users found things we hadn't forseen or had just plain forgotten. Pretty much every problem could be solved by using the ssh passwordless login to run simple sripts to sort things out (mainly cfengine find-and-replace scripts)
Long-Term Time Saving
Since the reinstall, we have had:
- 2 nodes disk failures
- one random apt-get error which made no sense and with FAI it was simpler to reinstall the node than to spend time fixing
Current estimate is that we have not saved time overall using FAI, although this may change
It is worth remembering that I had to learn much more than the average sys-admin to install and configure FAI and I think, assuming a sys-admin who knew how to script and the basics of common Linux services, FAI would have a net saving for our cluster already
Personal Impression
My personal impression of FAI is that it's a very nice system. The documentation is very good and the system works very smoothly and is well thought out, along with using the powerful and flexible class system for its basis. The only quibble I had was how buggy some of the install scripts were and how poorly documented the install scripts on FAI 2.7 were. I think this wouldn't be a problem so much for a more experienced sys-admin with an actual idea of how to get a linux system working from scratch beyon sticking in a Debian CD and answering the right queries. However, if FAI matures a lot more, I'd love to see official configs that are current, work with current packages in Debian, well documented and easy to configure and use.
TODO
In this section only, to do is:
Reread unchanged parts for sense
Reread whole for cinsistency and spelling
Add picture of the cluster.
Would you mind sharing your experiences?
If you would like to share your experiences, please, contact me.