Thursday Aug 21, 2008
Thursday Aug 21, 2008
But there are some critical factors which limit the consolidation of workloads into virtualized environments (VE's). One, often-overlooked factor is that the technology which controls VE's is a single point of failure (SPOF) for all of the VE's it is managing. If that component (a hypervisor for virtual machines, an OS kernel for operating system-level virtualization, etc.) has a bug which affects its guests, they may all be impacted. In the worst case, all of the guests will stop working.
One example of that was the recent licensing bug in VMware. If the newest version of VMware ESX was in use, the hypervisor would not permit guests to start after August 12. EMC created a patch to fix the problem, but solving it and testing the fix took enough time that some customers could not start some workloads for about one day. For some details, see http://www.networkworld.com/news/2008/081208-vmware-bug.html and http://www.deploylinux.net/matt/2008/08/all-your-vms-belong-to-us.html.
Clearly, the lesson from this is the importance of designing your consolidated environments with this factor in mind. For example, you should never configure both nodes of a high-availability (HA) cluster as guests of the same hypervisor. In general, don't assume that the hypervisor is perfect - it's not - and that it can't fail - it can.