Unlimited VMs per datastore? Its not a myth with Nutanix!

For many years, I have been asked on countless occasions questions relating to how many VMs can (or should) be placed in one datastore.

In fact, just this morning I was asked this same question, and I decided to whip up a quick post.

I have previously posted an Example Architectural Decision relating to Datastore sizing for Block based storage. What this example was aimed to show was a how things like RPO/RTO and performance should be taken into consideration when choosing a datastore size.

The above example is not a hard and fast rule, but an example of one deployment which I was involved in.

There is a great article written on this topic by VCDX, Jason Boche (@jasonboche), titled  “VAAI and the Unlimited VMs per Datastore Urban Myth” which covers in great detail this topic as it relates to block based storage, being iSCSI, FC & FCoE.

But what about NFS, and what about with Hyper-converged solutions like Nutanix?

NFS has gained significant popularity in recent years, and in my opinion, people who know what they are talking about, no longer refer to NFS as “Tier 3 Storage” which was once common.

With traditional storage solutions, generally only a smaller number of controllers can actively serve IO to the one NFS mount, so the limiting factor preventing running more virtual machines per NFS mount, in my experience was performance but things like RPO/RTO were and are important considerations.

NFS does not suffer from SCSI reservations which resulted in increased latency ,which is what VAAI, specifically the Atomic Test & Set or ATS primitive helped too all but eliminate for block based datastores.

LUNs are limited by there queue depth, which in most cases is 32 (sometimes 64). This is also a limiting factor, as all the VMs in a datastore (LUN) share the same queue which can lead to contention. SIOC helps manage the contention by ensuring fairness based on share values, but it does not solve the issue.

NFS on the other hand has a much larger queue depth, in fact its basically unlimited as shown below.

NFSqueuedepth

So as NFS does not suffer from SCSI reservations, or queue depth issues, what is limiting us having hundreds or more VMs per datastore?

It comes down to how many active storage controllers are able to service the NFS mount, and the performance of the storage controller/s. In addition to this your business requirements around RPO/RTO. In other words, if a NFS mount is lost, how quickly can you recover.

For most traditional shared storage products,

1. Have only 1 or 2 active controllers – thus potentially limiting performance which would lead to lower VMs per NFS datastore.

2. Do snapshots at the NFS mount layer, so if you need to recover an entire NFS mount, the larger it is, the longer it may take.

For Nutanix, by default, NFS is used to present the Nutanix Distributed File System (NDFS) to vSphere, however the key difference between Nutanix and traditional shared storage is every controller in the Nutanix cluster, can and does Actively serve IO to any datastore in the cluster concurrently.

So the limit from a performance perspective is gone thanks to Nutanix scale out, shared nothing architecture, with one virtual storage controller (CVM) per Nutanix node. The number of nodes that’s can be scaled too, is also unlimited. An example of Nutanix ability to scale can be found here – Scaling to 1 million IOPS and beyond, Linearly!

Next what about the RPO/RTO issue? Well, Nutanix does not rely on LUNs or NFS mounts for our data protection (or snapshots), this is all done at a VM layer so your RPO/RTO is now per VM, which gives you much more flexibility.

With Nutanix, you can literally run hundreds or even thousands of VMs per NFS datastore, without performance or RPO/RTO problems thanks to scale out, shared nothing architecture and the Nutanix Distributed File System.

There are some reasons why you may choose to have multiple NFS datastores even in a Nutanix environment, these include, if you want to enable Compression and/or De-duplication which are enabled/disabled on a per container (or datastore) level. As some workloads don’t compress or dedupe well, these types of workloads should be excluded to reduce the overhead on the cluster.

It is important to note, Nutanix uses a concept called a “Storage Pool” which contains all the storage for the Nutanix cluster. On top of a “Storage Pool” you create “Containers” (or datastores). This means regardless of if you have 1 or 100 datastores, they all still sit on top of the one “Storage Pool” which means you still have access to the same amount of storage capacity, with no silos for maximum capacity utilization (and performance!).

Lastly, Nutanix does not suffer from the same availability concerns as traditional shared storage where a single LUN could potentially be lost. This is due to the distributed architecture of the Nutanix solution. For more information on how Nutanix is more highly available than traditional shared storage, check out “Scale out, Shared Nothing Architecture Resiliency by Nutanix

Check out a screen shot of one cluster with ~800 VMs on a single datastore. Note: The sub millisecond latency and 14K IOPS w/ ~900MBps throughput. Not bad!

800VMsonDatastore

Scale Out Shared Nothing Architecture Resiliency by Nutanix

At VMware vForum Sydney this week I presented “Taking vSphere to the next level with converged infrastructure”.

Firstly, I wanted to thank everyone who attended the session, it was a great turnout and during the Q&A there were a ton of great questions.

I got a lot of feedback at the session and when meeting people at vForum about how the Nutanix scale out shared nothing architecture tolerates failures.

I thought I would summarize this capability as I believe its quite impressive and should put everyone’s mind at ease when moving to this kind of architecture.

So lets take a look at a 5 node Nutanix cluster, and for this example, we have one running VM. The VM has all its data locally, represented by the “A” , “B” and “C” and this data is also distributed across the Nutanix cluster to provide data protection / resiliency etc.

Nutanix5NodeCluster

So, what happens when an ESXi host failure, which results in the Nutanix Controller VM (CVM) going offline and the storage which is locally connected to the Nutanix CVM being unavailable?

Firstly, VMware HA restarts the VM onto another ESXi host in the vSphere Cluster and it runs as normal, accessing data both locally where it is available (in this case, the “A” data is local) and remotely (if required) to get data “B” and “C”.

Nutanix5nodecluster1failed

Secondly, when data which is not local (in this example “B” and “C”) is accessed via other Nutanix CVMs in the cluster, it will be “localized” onto the host where the VM resides for faster future access.

It is importaint to note, if data which is not local is not accessed by the VM, it will remain remote, as there is no benefit in relocating it and this reduces the workload on the network and cluster.

The end result is the VM restarts the same as it would using traditional storage, then the Nutanix cluster “curator” detects if any data only has one copy, and replicates the required data throughout the cluster to ensure full resiliency.

The cluster will then look like a fully functioning 4 node cluster as show below.

5NodeCluster1FailedRebuild

The process of repairing the cluster from a failure is commonly incorrectly compared to a RAID pack rebuild. With a raid rebuild, a small number of disks, say 8, are under heavy load re striping data across a hot spare or a replacement drive. During this time the performance of everything on the RAID pack is significantly impacted.

With Nutanix, the data is distributed across the entire cluster, which even with a 5 node cluster will be at least 20 SATA drives, but with all data being written to SSD then sequentially offloaded to SATA.

The impact of this process is much less than a RAID rebuild as all Nutanix controllers in the cluster participate and take a portion of the workload as a result the impact per disk, per controller ,per node and importantly for production VMs running in the cluster, is greatly reduced.

Essentially, the larger the cluster, the faster the cluster can repair itself, and the lower the impact on production workloads.

Now lets talk about a subsequent ESXi host failure, now we have two failed nodes, and three surviving nodes, and only one copy of data “A” , “B” and “C” as shown below.

Nutanix5NodeCluster2failures1copydata

Now the Nutanix “Curator” detects only one copy of data “A”, “B” and “C” exists and starts to replicate copies of “A”, “B” and “C” across the cluster. This results in the below which is a fully functional and redundant cluster, capable of surviving yet another failure as shown below.

Nutanix5NodeCluster2Failures

Even in this scenario, where two ESXi hosts are lost, the environment still has 60% of its storage controllers (and performance), as compared to a typical traditional storage product where the loss of just two (2) controllers can have your environment completely offline, and even if you only lost a single controller, you would only have 50% of the storage controllers (and performance) available.

I think this really highlights what VMware and players like Google, Facebook & Twitter have been saying for a long time, scaling out not up, and shared nothing architecture is the way of the future. The only question is who will be dominant in bringing this technology to the mass market, and I think you know who I have my money on.

Example Architectural Decision – Host Isolation Response for a Nutanix Environment

Problem Statement

What are the most suitable HA / host isolation response when using Nutanix?

Assumptions

1. vSphere 5.0 or greater
2. Two x 10GB Network interfaces are shared for Nutanix Storage Traffic and Virtual Machine Traffic

Motivation

1. Minimize the chance of a false positive isolation response
2. Ensure in the event the storage is unavailable that virtual machines are promptly shutdown to enable HA to recover the VMs in a timely manner (where other hosts are unaffected by isolation) and to prevent a “split brain” scenario
3. Ensure maximum availability

Architectural Decision

Turn off the default isolation address and configure the below specified isolation addresses, which check connectivity to multiple Nutanix Controller VMs (CVMs) on the IP Storage VLAN.

Configure the following Isolation addresses

das.isolationaddress1 : NDFS Cluster IP Address

Configure Host Isolation Response to: Power Off

For Nutanix Controller VMs override the cluster setting and configure Host Isolation Response to “Leave Powered On”

Justification

1. The ESXi Management traffic along with the Virtual machine traffic and inter-Nutanix node storage traffic is running over 2 x 10GB connections. Using the ESXi management gateway (default isolation address) to check for isolation is not suitable as the management network can be offline without impacting the IP storage or data networks. This situation could lead to false positives isolation responses.
2. The isolation addresses chosen tests IP storage connectivity over the converged 10Gb network and in the event this is unavailable, there is no point testing further connectivity as Virtual machines cannot function without their storage
3. In the event the Nutanix cluster IP address cannot be reached by ICMP the Node will not be able to properly function. As such, triggering isolation response and powering off the VMs based on this criteria is logical as the VMs will not be able to function under these conditions.
4. In the event the NDFS Cluster IP address does not respond to ICMP on the Management interfaces it is likely there has been an isolation event OR a catastrophic failure in the environment, either to the network, or the storage controllers themselves, in which case the safest option is to Power Off the VMs.
5. In the event the isolation response is triggered and the isolation does not impact all hosts within the cluster, the VMs can be restarted by HA onto a surviving host and resume functioning
6. Using the Nutanix Controller VM (CVM) IP address (192.168.5.2) for the Isolation address is not suitable as this address exists on each ESXi hosts and as such it is not a good candidate for isolation detection as the host will always be able to find this address even when the network is offline due to the CVM being local to the host
7. The Nutanix Controller VM accesses local storage and can continue to run locally even in an isolation event. When the isolated event is over, the CVM will then regain connectivity to the other CVMs in the Nutanix cluster.
8. Shutting down the CVM would only increase the recovery time once the isolation even is over and has no added benefits.

Implications

1. In the event the host cannot reach any of the isolation addresses, virtual machines will be powered off.
2. Initial cluster setup would require the vSphere administrator to override the Cluster settings for each Controller VM. Note: This is a one time task (Set & Forget)

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address
4. Leave the CVM on the default cluster setting and “Shutdown” on isolation

Related Articles

1. VMware Host Isolation Response in a Nutanix Environment #NoSAN

2. Storage DRS and Nutanix – To use, or not to use, that is the question?

3. VMware HA and IP Storage