VMware HA and IP Storage *Updated*

With IP storage (particularly NFS in my experience) becoming more popular over recent years, I have been designing more and more VMware solutions with IP Storage, both iSCSI and NFS.

The purpose of this post is not to debate the pros and cons of IP storage, or Block vs File, or even vendor vs vendor but to explore how to ensure a VMware environments (vSphere 4 and 5) using IP storage can be made as resilient as possible purely from a VMware HA perspective. (I will be writing another post on highly available vNetworking for IP Storage)

So what are some considerations when using IP storage and VMware HA?

In many solutions I’ve seen (and designed), the ESXi Management VMKernel is on “vSwitch0” and uses two (2) x 1GB NICs while the IP storage (and Data network) is on a dvSwitch/es and uses two or more 10Gb NICs which connect to different physical switches than the ESXi Management 1GB NICs.

So does this matter? Well, while it is a good idea, there are some things we need to consider.

What happens if the 1GB network is offline for whatever reason, but the 10GB network is still operational?

Do we want this event to trigger a HA isolation event? In my opinion, not always.

So lets investigate further.

1. Host Isolation Response.

Host Isolation response is important to any cluster, but for IP storage it is especially critical.

How does Host Isolation Response work? Well, in vSphere 5, it requires 3 conditions to be met

1. The host fails to receive heartbeats from the HA master

2. The host does not receive any HA election traffic

3. Failing conditions 1&2 , the host attempts to ping the “isolation address/es” and is unsuccessful.

4. The isolation response is triggered

So in the scenario I have provided, the goal is to ensure that if a host becomes isolated from the HA Primary nodes (or HA Master in vSphere 5)  via the 1GB Network that the host does not unnecessarily trigger the “host isolation response”.

Now why would you want to stop HA restarting the VM on another host? Don’t we want the VMs to be restarted in the event of a failure?

Yes & No. In this scenario its possible the ESXi host still has access to the IP Storage network, and the VM the data network/s via the 10Gb Network. The 1Gb Network may have suffered a failure, which may effect management, but it may be desirable to leave the VMs running to avoid outages.

If  both the 1GB and 10GB networks go down to the host, this would result in the host being isolated from the HA Primary nodes (or HA Master in vSphere 5), the host would not receive HA election traffic and the host would suffer an “APD” (All Paths Down) condition. HA isolation response will then rightly be triggered and VMs will be “Powered Off”. This is desirable as the VMs could then be restarted on the surviving hosts assuming the failure is not network wide.

Here is a screen grab (vSphere 5) of the “Host Isolation response” setting, which is located when you right click your cluster “Edit Settings”, “vSphere HA” and “Virtual Machine Options”.

The host isolation response setting for environments with IP Storage should always be configured to “Power Off” (and not Shutdown). Duncan Epping explained this well in his blog, so no need to cover this off again.

But wait, there’s more! 😉

How do I avoid false positives which may cause outages for my VMs?

If using vSphere 5, we can use Datastore Heartbeating (which I will discuss later), but in vSphere 4 some more thought needs to go into the design.

So lets recap step three in the isolation detection process we discussed earlier

“3. Failing conditions 1&2 , the host attempts to ping the “isolation address/es”

What is the “isolation address”? By default, its the ESXi Management VMKernel default gateway.

Is this the best address to check for isolation? In a environment without IP storage, normally in my experience it is suitable, although it is best to discuss this with your Network architect as the device you ping needs to be highly available. Note: It also needs to respond to ICMP!

When using IP storage, I recommend overriding the default by configuring the advanced setting “das.usedefaultisolationaddress” value to “false”. Then configure the “das.isolationaddress1” through “das.isolationaddress9” with the IP address/es of your IP storage (in this example, Netapp vFilers), the ESXi host will now ping your IP storage assuming the HA Primaries (or “Master” in vSphere 5) is unavailable and no election traffic is being received) to check if it is isolated or not.

If the host/s complete the isolation detection process and are unable to ping any of the isolation addresses (IP Storage), (and therefore the ESXi host will not be able to access the storage) it will declare itself isolated and trigger a HA isolation response. (Which should always be “Power Off” as we discussed earlier)

The below screen shot shows the Advanced options and the settings chosen.

In this case, the IP Storage (Netapp vFilers) are connected to the same physical 10Gb Switches and the ESXi hosts (one “hop”) so they are a perfect way to test network connectivity of the network and access to the storage.

In the event the IP Storage (Netapp vFilers) are inaccessible, this alone would not trigger HA isolation response as the connectivity to the HA Primary nodes (or HA Master in vSphere 5) may still be functional. If the Storage is in fact inaccessible for >125secs (if using default settings – NFS “HeartbeatFrequency” of 12 seconds & “HeartbeatMaxFailures” of 10) the datastore/s will be marked as Unavailable and a “APD” event may occur. See VMware KB 2004684 for details on APD events.

Below is a screen grab of a vSphere 5 host showing the advanced NFS settings discussed above.

Note: With Netapp Storage it is recommended to configure the VMs with a disk timeout of 190 seconds, to allow for intermittent network issues and/or total controller loss (which takes place in <180 seconds, usually much less), and therefore the VMs can continue running and no outage is caused.

My advice would be modifying the “das.usedefaultisolationaddress” and “das.isolationadressX” is an excellent way in vSphere 4 (and 5) of ensuring your host is isolated or not by checking the IP storage is available, after all, the storage is critical to the ESXi hosts functionality! 😀

If for any reason the IP Storage is not responding, assuming the HA isolation detection process step 1 & 2 have completed, an isolation event is triggered and HA will take swift action (Powering Off the VM) to ensure the VM can be restarted on another host (assuming the issue is not network wide).

Note: Powering Off the VM in the event of Isolation helps prevent a split brain scenario where the VM is live on two hosts at the same time.

While datastore heart-beating is an excellent feature, it is only used by the HA Master to verify if a host is “isolated” or “failed”, the “das.isolationaddressX” setting is a very good way to ensure your ESXi host can check if the IP storage is accessible or not, and in my experience (and testing) works well.

Now, this brings me onto the new feature in vSphere 5…..

2. Datastore Heart beating.

It provides that extra layer of protection from HA isolation “false positives”, but adds little value for IP Storage unless the Management and IP Storage run over different physical NICs (in the scenario we are discussing they do).

Note: If the “Network Heartbeat” is not received, and the “Datastore Heartbeat” is not received by the HA Master, the host is considered “Failed” and the VMs will be restarted. But, If the “Network Heartbeat” is not received & “Datastore Heartbeat” is received by the HA Master, The host is “Isolated” and HA will trigger the “Host isolation response”.

The benefit here, in the scenario I have described, the “das.usedefaultisolationaddress” setting is “false” preventing HA trying to ping the VMK default gateway & “das.isolationaddress1” & “das.isolationaddress2” have been configured so HA will ping the IP Storage (vFilers) to check for isolation.

Datastore heartbeats, was configured to “Select any of the cluster datastores taking into account my preferences”. This allows a VMware administrator to specify a number of datastores , and these should be datastore critical to the operation of the cluster (Yes, I know, almost every data store will be important).

In this case, being a Netapp environment, the best practice is to separate OS / Page-file / Data / vSwap etc.

Therefore I decided to select the Windows OS & the Swap File datastores, as without these, all the VMs would not function, so they are the logical choice.

The below screen grab shows where Datastore heart-beating is configured, under the Cluster settings.

So what has this achieved?

We have the ESXi host pinging the isolation addresses (Netapp Filers), and we have the HA Master checking Datastore Heartbeating to accurately identify if the host is failed , isolated or partitioned. In the event HA Master does not receive Network heartbeats or Datastore heartbeats, then it is extremely likely there has been a total failure of the network (at least for this host) and the storage is no longer accessible, which obviously means the VMs cannot run, and therefore the host will be considered “Failed” by the master. The host will then trigger the configured “host isolation response” which for IP storage is “Power off”.

QUOTE: Duncan Epping – Datastore Heartbeating “To summarize, the datastore heartbeat mechanism has been introduced to allow the master to identify the state of hosts and is not use by the “isolated host” to prevent isolation.”

I couldn’t have said it better myself.

If the failure is not effecting the entire cluster, then the VM will power off and be recovered by VMware HA shortly there after. If the network failure effects all hosts in the cluster, then the VM will not be restarted until the network problem is resolved.

VMware Clusters – Scale up or out?

I get asked this question all the time, is it better to Scale up or out?

The answer is of course, it depends. 🙂

First lets define the two terms. Put simply,

Scale Up is having larger hosts, and less of them.

Scale Out is having more smaller hosts.

What are the Pro’s and Con’s of each?

Scale Up 

* PRO – More RAM per host will likely achieve higher transparent memory sharing (higher consolidation ratio!)

* PRO – Greater CPU scheduling flexibility as more physical cores are available (less chance for CPU contention!)

* PRO – Ability to support larger VMs (ie: The 32vCPU monster VM w/ 1TB RAM)

* PRO – Larger NUMA node sizes for better memory performance. Note: For those of you not familiar with NUMA, i recommend you check out Sizing VMs and NUMA nodes | frankdenneman.nl

* PRO – Use less ports in the Data and Storage networks

* PRO – Less complex DRS simulations to take place (every 5 mins)

* CON – Potential for Network or I/O bottlenecks due to larger number of VMs per host

* CON – When a host fails, a larger number of VMs are impacted and have to be restarted on the surviving hosts

* CON – Less hosts per cluster leads to a higher HA overhead or “waste”

* CON – Less hosts for DRS to effectively load balance VMs across

Scale Out

* CON – Less RAM per host will likely achieve lower transparent memory sharing (thus reducing overcommitment)

* CON – Less physical cores may impact CPU scheduling (which may lead to contention – CPU ready)

* CON – Unable to support larger VMs (ie: 8vCPU VMs or the 32vCPU monster VM w/ 1TB RAM)

* CON – Use more ports in the Data and Storage networks – ie: Cost!

* PRO – Less likely for Data or I/O bottlenecks due to smaller number of VMs per host

* PRO – When a host fails, a smaller number of VMs are impacted and have to be restarted on the surviving hosts

* PRO – More hosts per cluster may lead to a lower HA overhead or “waste”

* PRO – Greater flexibility for DRS to load balance VMs

Overall, both Scale out and up have their advantages so how do you choose?

When doing your initial capacity planning exercise, determine how many VMs you will have day 1 (and their vCPU/RAM/Disk/Network/IOPS) and try and start with a cluster size which gives you the minimum HA overhead.

Example: If you have 2 large hosts with heaps of CPU / RAM your HA overhead is 50%, if you have 8 smaller hosts your overhead is 12.5% (both with N+1).

As a general rule, I believe the ideal cluster would be large 4 way hosts with a bucket load of ram and around 16-24 hosts. This would be in my opinion the best of both worlds. Sadly, few environments meet the requirements (or have the budget) for this type of cluster.

I believe a cluster should ideally start with enough hosts to ensure a sufficient number of hosts to minimize the initial HA overhead (say <25%) and ensure DRS can load balance effectively, then scale up (eg: RAM) to cater for additional VMs. If more compute power is required in future, scaling out and then scaling up (add RAM) further. I would generally suggest not to design to the maximum, so up to 24 node clusters.

From a HA perspective, I feel in a 32 node cluster, 4 hosts worth of compute should be reserved for HA, or 1 in 8 (12.5% HA Reservation). Similar to the RAID-DP concept from Netapp, of 14+2 disks in a RAID pack.

Tip: Choose Hardware which can be upgraded (Scaled up) . Avoid designing a cluster with hosts hardware specs maxed out day 1.

There are exceptions to this, such as Management clusters, which may only have (and need) 2 or 3 hosts over their life span, (eg: For environments where vCloud Director is used), or environments with static or predictable workloads.

To achieve the above, the chosen hardware needs to be upgradable, ie: If a Servers maximum RAM is 1TB, you may consider only half populating it (being careful to choose DIMMs that allow you to expand) to enable you to scale up as the environments compute requires grow.

Tip: Know your workloads! So use tools like Capacity Planner so you understand what your designing for.

It is very important to consider the larger VMs, and ensure the hardware you select has suitable number of physical cores.

Example: Don’t expect 2 x 8vCPU VMs (highly utilized) to run well together on a 2 way 4 core host.

When designing a new cluster or scaling an existing one, be sure to consider the CPU to RAM ratio, so that you don’t end up with a cluster with heaps of available CPU and maxed out memory or vice versa. This is a common mistake i see.

Note: Typically in environments I have seen over many years, Memory is almost always the bottleneck.

The Following is an example where a Scale Out and Up approach end up with very similar compute power in their respective clusters, but would likely have very different performance characteristics and consolidation ratios.

Example Scenario: A customer with 200 VMs day one , and lets say the average VM size is 1vCPU / 4GB RAM but they have 4 highly utilized 8vCPU / 64GB Ram VMs running database workloads.

The expected consolidation ratio is 2.5:1 vCPUs to physical cores and 1.5:1 vRAM to physical Ram.

The customer expects to increase the number of VMs by 25% per year, for the next 3 years.

So our total compute required is

Day one : 92.8 CPU cores and 704GB Ram.

End of Year 1 : 116 CPU cores and 880GB Ram.

End of Year 2 : 145 CPU cores and 1100GB Ram.

End of Year 3 : 181 CPU cores and 1375GB Ram.

The day 1 requirements could be achieved in a number of ways, see two examples below.

Option 1 (Scale Out) – Use 9 hosts with 2 Way / 6 core / 96GB Ram w/ HA reservation of 12% (~N+1)

Total Cluster Resources = 108 Cores & 864GB RAM

Usable assuming N+1 HA = 96 cores & 768GB RAM

Option 2 (Scale Up) – Use 4 hosts with 4 Way / 8 core / 256GB Ram w/ HA reservation of 25% (~N+1)

Total Cluster Resources = 128 Cores & 1024GB RAM

Usable assuming N+1 HA = 96 cores & 768GB RAM

Both Option 1 and Option 2 appear to meet the Day 1 compute requirements of the customer, right?

Well, yes, at the high level, both scale out and up appear to provide the required compute resources.

Now lets review how the clusters will scale to meet the End of Year 3 requirements, after all, we don’t design just for day 1 do we. 🙂

End of Year 3 Requirements : 181 CPU cores and 1375GB Ram.

Option 1 (Scale Out) would require ~15 hosts (2RU per host) based on CPU & ~15 hosts based on RAM plus HA capacity of ~12% (N+2 as the cluster is >8 hosts.) taking the total required hosts to 18 hosts.

Total Cluster Resources = 216 Cores & 1728GB RAM

Usable assuming N+2 HA = 190 cores & 1520GB RAM

Note: At between 16 and 24 hosts N+3 should be considered. (Equates to 1 spare host of compute per 8 hosts)

Option 2 (Scale Up) – would require Use 6 hosts (4RU per host) based on CPU &  5 hosts based on RAM plus HA capacity of ~15% (N+1 as the cluster is <8 hosts.) taking the total required hosts to 7 hosts.

Total Cluster Resources = 224 Cores & 1792GB RAM

Usable assuming N+1 HA = 190 cores & 1523GB RAM

So on the raw compute numbers, we have two viable options which scale from Day to end of Year 3 and meet the customers compute requirement.

Which option would I choose I hear you asking, good question.

I think I could easily defend either Option, but I believe Option 2 would be be more economically viable and result in better performance. The below are a few reasons for my conclusion.

* Option 2 Would give significant transparent page sharing, compared to Option 1 therefore getting a higher consolidation ratio.

* Option 2 would likely be much cheaper from a Network / Storage connectivity point of view (less connections)

* Option 2 is more suited to host the 4 x 8vCPU highly utilized VMs (as they fit within a NUMA node and will only use 1/4 of the hosts CPU as opposed to 3/4’s of the 2 Way host)

* The 4 way (32 core) host would provide better CPU scheduling due to the large number of cores

* From a data center perspective, Option 2 would only use 28RU compared to 36RU

Note: A cluster of 7 hosts is not really ideal, but in my opinion is large enough to get both HA and DRS efficiencies. The 18 node cluster (option 1) is really in the sweet spot for cluster sizing, but the CPUs did not suit the 8 vCPU workloads. Had Option 1 used 8 core processors that would have made Option 1 more attractive.

Happy to hear everyone’s thoughts on the topic.

Common Mistake: Inefficient cluster sizes

Link

In my day job, I regularly come across environments which are running poorly and have inefficient designs.

One of the most common issues I see is VMware environments which cannot power on VMs due to being out of compute resources, but not for the reasons you may expect.

While the environments may have less than optimal HA settings / policies, the most common issues I see is customers (for whatever reason) having multiple clusters with only a few nodes. (ie: 2/3/4 etc)

Some of the time, there are corporate policies which may require this type of setup, but alot of the time, you can comply with these policies while still optimizing the environment.

It seems that even with virtualisation having been common place for many years, the basics are still mis-understood by a significant percentage of industry professionals. I have heard comments event recently saying you need 2 node clusters for maximum HA efficiency, They couldn’t be more Wrong!

So, why are small clusters a potential problem?

Depending on what HA setting you choose (Host failures cluster tolerates , Percentage of cluster resources reserved for HA, or Failover Host/s), the clusters have a large amount of “waste”.

What is “Waste”?

“Waste”, is the amount of the compute power within the cluster, that cannot be used to ensure in a HA event, VMs can be restarted on the remaining hosts.

Now at this stage, let me point out, some “Waste” is a good thing. We need to have some spare capacity for HA events, but the challenge is to minimize the waste without compromising HA.

So, in a recent environment I reviewed, there was 4 clusters using similar IBM x3850 Servers.

Cluster 1 : 2 Nodes

Cluster 2 : 2 Nodes

Cluster 3: 3 Nodes

Cluster 4 : 2 Nodes

In all clusters, HA was enabled (as it should be) and the HA admission control setting was “Percentage of Cluster resources reserved for HA” (which I prefer).

The 2 node clusters HA reservation percentage was set to 50%, and the 3 node cluster was 33%, which would be the settings I would choose if I had to stick with the 4 cluster design.

Because the environment (in its current state) was unable to host any more VMs, the customer wanted to purchase another 2 new Hosts, and form a new cluster.

At this stage we have the equivalent of 4 hosts of “waste” within the environment, and with a new cluster we would have 5 hosts “wasted”.

Now after a quick check of the VMware EVC KB: 1003212 all CPUs are compatible with EVC and support the EVC mode “Intel® “Merom” Generation”.

So, we can form a single new cluster using the existing 9 hosts and maintain full cluster functionality by enabling EVC.

Lets assume the hosts are all in a cluster and we’re configuring HA, How do we ensure we have more available compute for the new virtual machines?

Simple, we Enable HA (as you always should), Enable admission control, and set the HA policy to “Percentage of Cluster resources reserved for HA, But what percentage should we choose?

Well, it depends of what level of redundancy you require.

Generally, I recommend for

<8 hosts = N+1 – Note: If you require N+1 during maintenance you need N+2

>8 hosts < 16 hosts = N+2

>16 hosts <24 hosts = N+3

>24 hosts = N+4

The reason for the above, is as you add more hosts, your chance of a host failure, and a subsequent host failure increases. Therefore the more hosts you have, the more redundancy you need, Similar concept to RAID.

So in this example, we’re right on the line in terms of N+1 or N+2.

Lets be conservative, and choose N+2, therefore setting “Percentage of Cluster resources reserved for HA” to 22% (N+2 is actually 22.5%, but we use round numbers).

So what have we achieved?

The previous setup had only N+1 and an average HA overhead of 45.75% (50%+50%+50%+33% divide 4).

The new 9 node cluster now with N+2 redundancy and only has an overhead of 22%. A NET gain of 23.75% of available compute resources without purchasing new hardware.

What else do we gain by having a single larger cluster:

1. Increased DRS flexibility

2. Increase redundancy (previously N+1, now N+2)

3. Less chance of contention

4. No need to purchase new hardware!!

The above is a simple example of how to increase efficiency within a VMware environment without purchasing new hardware.

Now for those of you wanting to know more about HA/DRS, this has been covered in great detail in other blogs, I would recommend you first have a read of the following blog and get a copy of “vSphere 5.0 Clustering technical deep dive” book.

Yellow Bricks (Duncan Epping) – HA Admission control Pros and Cons