Is what I said to my manager yesterday.
As part of a move at work to move from a traditional many (over specced) servers in racks for services, we are moving to fewer (very well specced) blade servers in racks with storage from a SAN (storage attached network) running VMware infrastructure. The theory being we can run lots of little virtual servers specced to do the job that they need to do and nothing else. This means that we can claw back some space in the data centers and also to play the green card by having fewer physical machines.
One of the tasks that I am involved in, is getting VMware installed onto the blade servers, although there are shiny GUI tools that the point n click boys like, there is also a service console based around Red Hat Linux – which to be honest is a darn sight easier to drive when configuring a virtual network infrastructure – mainly on the grounds that it can be scripted. On the grounds that we want in the long term it to be as little effort as possible we are looking at Altiris Rapid Deployment Manager (RDP). The idea being that our asset database can feed enough info to RDP so that when a brand new blade is inserted into a chassis on first boot it will automatically be installed with VMware, patched, networking configured and other bits done – with no sys admin work required.
A couple of days ago we pretty much had it working, then it did not. When I say working we had got to the point where the VMware install worked and ran our customisation scripts. Admittedly the scripts were still being tweaked to auto setup the virtual network infrastructure – but that really is trivial stuff the hard bit had been done. Two days ago we also started looking at making the SAN storage available to the VMware hosts to run the virtual machines.
Yesterday when testing the build scripts they failed every time, installation of the OS was just not happening. It was as if the storage had vanished. Then it occurred to us that at different points of the blade boot and installation process that the hardware storage is not necessarily available at the same place consistently.
So although there are internal disks in the blade and they are enumerated 0 and 1 at the BIOS level within the blade, it gets forgotten about when an OS is installed. Taking the boot process, the blades are configured to do a network (PXE) boot and should that fail to boot off the internal disks. Back to the PXE boot part, the RDP software knows which blades it has installed and should it see a new blade, it then sends the installation OS down the network which then runs on the blade which tells it how to build itself with VMware. Once it has done this, it ignores that blade since it knows that it has provided an installation and the blade should then boot of it’s own disks.
In our case the machines did a PXE boot and as they were new received the installation OS which did its stuff, they then rebooted but failed complaining of no system disk being available. After some (around a day of) head scratching it dawned on us that with other testing going, on a LUN from the SAN was being presented to the blades; the machines PXE booted, received the install OS which knows about SAN LUNs and so spotted the presented SAN and enumerated its disks as LUN, disk 1, disk 2; consequently the install OS installed VMware onto the LUN rather than the internal disks and so the blade boot failed as the OS was not actually on the blade.
Where in the manual it says: “Do no present a LUN from the SAN to the VMware servers prior to the VMware installation”, it tends to do so for a reason.
This part of the manual was read by myself and a colleague some time ago, and we both noted that it was clearly rather important and that it must be a silly thing to do.
The stupid thing is I was burnt on something similar years ago with Sun kit when performing a Solaris 7 to 8 upgrade where the internal disk structure had been added to over the years. The documentation then said ensure that you do an upgrade rather than a fresh install – so we did during testing and completely ignoring it for the live system.
dmc: path_to_inst