How Vmotion made me cry.
So. There I was, happily bathing in the rays of my newly created W2K3 cluster. Using NLB, it was happily distributing incoming requests through to the included servers and my life was a good thing.
At one of my previous customers, we implemented a NLB-cluster (Network Load Balancing) that had two WWW-servers in it. Hosted on those servers were .ico and .osd files (used by Softricity for application streaming) and the configuration files for our 80 or so thin clients that were provided to the Program Neighbourhood Agents on them.
At first, setting up the cluster was a tad tricky. When I started to configure the blasted thing, I needed to use Multicast mode, as the switches in front of it made it impossible to set up the primary clustertraffic in Unicast mode. However, after the cluster had been created, the IP-address of the cluster wasn't reachable until I set the cluster to Unicast mode.. so, there I was, cursing away at the switch.. but hey, it worked, and it worked well (if anyone wants an in-depth technical analysis of why this happened, leave me a comment and I'll be happy to elaborate.. but beware, it can get messy..)
The first day of testing arrives. Sixty-or-so students log onto the new SBC-environment and things are looking good. The ESX servers are holding up, the Citrix environment is humming away (though I still don't like the performance, but that's another ESX discussion..) and things appear to be fine.
About the same time, a colleague of mine suggests that we test the Vmotion feature. For those of you who aren't familiair with this feature, allow me to explain.
Virtual machines are bound to a specific ESX server. However, the nice people at VMware understand the necessity of being able to move these virtual machines on the fly to a different machine, which is what the Vmotion feature allows you to do. Using this feature, the content of the VM and its parameters are adjusted as necessary so that a different ESX server hosts the virtual machine.
Now, it would seem that using a feature like this on a production environment might seem tricky, the situation didn't look that tricky to me. After all, Citrix and its sessions can handle themselves just fine if the connection drops for a second or two, so on we went. For a minute or two, we moved machines back and forth like there was no tomorrow, and about 5 minutes later.. there no longer was a tomorrow..
Citrix servers were unavailable, sessions were lost.. all in all, the place had become a complete mess. It took us a few moments to realise what had happened, but here it is:
During this Vmotion-fun, we also moved one of the clusternodes to another ESX-server. Now, what happens is that the node goes offline, is moved and then comes back online. At first, it will attempt to initialise the clusterbinding.. at which it fails! After all, the cluster is configured to use Unicast mode which didn't work through the switch.. so we ended up with two different clusternodes, both pretending to be the cluster.
It took us about 20 minutes to fix this mess. It wasn't pretty. So, beware when you decide to use those funky features and think carefully on what you're doing.
PS: What happened was that the second clusternode was no longer reachable from the first clusternode and vice versa. As both nodes were convinced that they were 'the cluster', the clusterconfiguration completely broke, and both clusters (as they both thought they were a cluster with one node) stopped functioning. Again, if anyone wants a technical breakdown of that, leave a comment.