Nutanix Scalability – Part 2 – Compute (CPU/RAM)

Following on from Part 1 of the Scalability series where we discussed how Nutanix can scale storage capacity seperate to compute, the next obvious topic is to talk about scaling CPU and Memory resources at both the workload and cluster level.

Let’s first recap the problems with scaling compute with traditional shared storage.

HCInotHCI

Yuk! That looks like old school 3-tier stuff to me!

Non HCI workloads on compute only nodes would therefore:

  • Be running in the same setup as traditional 3-tier infrastructure
  • Have different performance than HCI based workloads
  • Loose the advantage of having compute + storage close together
  • Increase dependency on Network
  • Impact network utilization of HCI node/s
  • Impact benefits of HCI for the native HCI workloads and much more.

The industry has accepted HCI as they way of the future and while adding compute only nodes might sound nice at a high level, its just re-introducing the classic 3-tier complexity and problems of the past when if we review the actual requirements it’s very rare to see a Nutanix node have insufficient resources when sized/configured correctly.

Customers are often surprised when they show me their workloads and I don’t seem surprised by the CPU/RAM or storage IO or capacity requirements. I can’t tell you how many times I’ve made statements like “You’re applications requirements are not that high, I’ve seen much worse!”.

Examples of scaling compute with Nutanix

Example 1: Scaling up a Virtual Machines compute resources:

SQL/Oracle DBA: Our application is growing/running slowly, We need more CPU/RAM!

Nutanix: You have several options:

a) Scale up the virtual machine’s vCPUs and vRAM to match the size of the NUMA node.
b) Scale up the virtual machine’s vCPUs and vRAM to be the same number of pCore’s as the host minus the Nutanix CVM vCPUs and do the same with the RAM.

The first option is the optimal as it will ensure maximum memory performance as the CPU will be assessing memory within the NUMA boundary, however the second option is still viable and for applications such as SQL, the impact of insufficient memory can be higher than the penalty of crossing a NUMA boundary.

BUT MY WORKLOAD IS UNIQUE, IT NEEDS A PHYSICAL SERVER!!

Despite hearing these type of statements by prospective and existing customers, Very few workloads actually need more CPU/RAM that a modern Nutanix (or OEM/Software only) node can provide even if you remove resources for the Controller VM (CVM). I find that it’s usually only a perceived requirement for physical servers and in reality, a reasonably sized VM on a standard node will deliver the desired business outcome/s comfortably.

Currently Nutanix NX nodes support Intel Platinum 8180 processors which have 28 physical cores @ 2.5 GHz per socket for a total of 56 physical cores (112 threads).

If you had say an existing physical server using a fairly modern Intel Broadwell E5-2699 v4 with dual 22 physical core processors, you have a total SPECint_rate of 1760 or 40 per core.

Compare that to the Intel Platinum 8180 processor and you have a total SPECint_rate of 2720 or 48.5 per core.

This is an increase per core of 21.25%.

So if you’re moving from that physical server using Intel Broadwell E5-2699 v4 CPUs (44 cores) and you move that workload to Nutanix with ZERO CPU overcommitment (vCPU:pCore ratio 1:1) using the Intel Platinum 8180 processor, assuming we reserve 8 pCores for the CVM we still have 48 pCores for the SQL VM.

That’s a SpecIntRate of 2328 which is higher than the physical server using all cores.

That’s over 32% more CPU performance for the Virtual Machine compared to the dedicated physical server.

The reality is the Nutanix CVM and Acropolis Distributed Storage Fabric (ADSF) provides high performance, low latency storage which also drives further CPU efficiency by eliminating CPU WAIT (CPU cycles wasted waiting for I/O to complete).

As you can see from this simple example, a Virtual Machine on Nutanix can easily replace even a modern physical server and even provide better performance with only one generation newer CPU. Think about how your 3-5 year old physical servers will feel when they jump multiple generations of CPU and get scale out flash based storage.

Example 2: A VM (genuinely) needs more CPU/RAM than Nutanix nodes have.

SQL/Oracle DBA: Our application/s is needs more CPU/RAM than our biggest node/s can provide.

Nutanix: You have several options:

a) Purchase one or more larger node (e.g.: NX-8035-G6 w/ Intel Platinum or Gold Processors, add them to the existing cluster and live migrate your VM/s to that/those nodes. Use affinity rules to keep critical VMs on the highest performance nodes.

Nutanix supports mixing different hardware types/generations in the same cluster and this can be a preferred option over creating a dedicated cluster for several reasons.

  • Larger clusters provide more targets for replication traffic (i.e.: RF2 or RF3) meaning lower average write latency
  • Larger clusters provide higher resiliency as they can potentially tolerate more failures and rebuild follow a drive/node or nodes failing faster.
  • Larger clusters help ensure the impact of a failure is lower as a lower percentage of cluster resources are lost

b) Purchase one or more larger node (e.g.: NX-8035-G6 w/ Intel Platinum or Gold Processors, and create a new cluster and migrate your VM/s to that cluster.

A dedicated cluster may sound attractive, but in most cases I recommend mix workload clusters as they ultimately provide higher performance, resiliency and flexibility.

c) Scale out your workloads

Applications like MS Exchange, MS SQL and Oracle RAC can (and arguably should) be scaled out rather than scaled up as doing so provides increased performance, resiliency and reduces overall infrastructure costs (e.g.: More cheaper/smaller processors can be used as opposed to premium processors like Intel Platinum series).

One large VM hosting dozens of databases is rarely a good idea, so scale out and run more VMs, distributed across your Nutanix cluster and spread the workload across all the VMs.

For 99% of workloads, I do not see the real world value of compute only nodes. But there are always exceptions to every rule.

Potential Exceptions:

Example 3: Re-using existing hardware

SQL DBA: I love my Nutanix gear (duh!) but I have some physical servers which wont be end of life for 12 months, can I continue using them with Nutanix?

Nutanix: We have several options:

a) If the hardware is on our Software-only hardware compatibility list (HCL), it’s possible you can purchase SW-only licenses and deploy Nutanix on your existing hardware.

b) Use Nutanix Acropolis Block Services (ABS) to provide highly available scale out storage to your physical server via iSCSI.

ABS was released in 2015 and supports SCSI-3 persistent reservations for shared storage-based Windows clusters, which are commonly used with Microsoft SQL Server and clustered file servers.

ABS supports several use cases, including:

  • iSCSI for Microsoft Exchange Server.
  • Shared storage for Linux-based clusters
  • Windows Server Failover Clustering (WSFC).
  • SCSI-3 persistent reservations for shared storage-based Windows clusters
  • Shared storage for Oracle RAC environments.
  • Bare-metal environments.

Therefore ABS allows you to re-use your existing hardware to maximise your return on investment (ROI) while getting the benefits of ADSF. Once the hardware is end of life, the storage already on the Nutanix cluster can be quickly presented to a VM so the workload will benefit from the full Nutanix HCI experience.

Future Capabilities:

In late 2017, Nutanix announced Nutanix Acropolis Compute Cloud (AC2) which will provide the ability to have true compute-only nodes in a Nutanix cluster as shown below.

I reluctantly mention this upcoming capability because I do not want to see customers go back to a 3-teir model or think that HCI isn’t the way forward because it is. That’s not what compute-only is about.

This capability is specifically designed to work around the niche circumstances where a software vendor such as Oracle, are extorting customers from a licensing perspective and it’s desirable to maximise the CPU cores the application can use.

Let me have a quick rant and put an end to the nonsense before it gets out of hand:

IT IS NOT FOR GENERAL VM USE!

NO ITS NOT FOR PERFORMANCE REASONS.

NO NUTANIX IS NOT MOVING BACK TO A 3-TIER COMPUTE+STORAGE MODEL.

HCI WITH NUTANIX IS STILL THE WAY FORWARD

Summary:

Nutanix provides excellent scalability at the CPU/RAM level for both virtual and physical workloads. In rare circumstances where physical servers are a real (or likely just a perceived) requirement, ABS can be used while Nutanix will soon also provide Compute-only for AHV customers to ensure licensing value is maximised for those rare cases.

Back to the Scalability, Resiliency and Performance Index.

Nutanix Scalability – Part 1 – Storage Capacity

It never ceases to amaze me that analysts as well as prospective/existing customers frequently are not aware of the storage scalability capabilities of the Nutanix platform.

When I joined back in 2013, a common complaint was that Nutanix had to scale in fixed building blocks of NX-3050 nodes with compute and storage regardless of what the actual requirement was.

Not long after that, Nutanix introduced the NX-1000 and NX-6000 series which had lower and higher CPU/RAM and storage capacity options which gave more flexibility, but still there were some use cases where Nutanix still had significant gaps.

In October 2013 I wrote a post titled “Scaling problems with traditional shared storage” which covers why simply adding shelves of SSD/HDD to a dual controller storage array does not scale an environment linearly, can significantly impact performance and add complexity.

At .NEXT 2015, Nutanix announced the ability to Scale Storage separately to Compute which allowed customers to scale capacity by adding similar to a shelf of drives like they could with their legacy SAN/NAS, but with the added advantage of having a storage controller (the Nutanix CVM) to add additional data services, performance and resiliency.

Storage only nodes are supported with any Hypervisor but the good news in they run on Nutanix’ Acropolis Hypervisor (AHV) which means no additional hypervisor licensing if you run VMware ESXi, and storage only nodes still support all the 1-click rolling upgrades so they add no additional management overhead.

Advantages of Storage Only Nodes:

  1. Ability to scale capacity seperate to CPU/RAM like a traditional disk shelf on a storage array
  2. Ability to start small and scale capacity if/when required, i.e.: No oversizing day 1
  3. No hypervisor licensing or additional management when scaling capacity
  4. Increased data services/resiliency/performance thanks to the Nutanix Controller VM (CVM)
  5. Ability to increase capacity for hot and cold data (i.e.: All Flash and Hybrid/Storage heavy)
  6. True Storage only nodes & the way data is distributed to them is unique to Nutanix

Example use cases for Storage Only Nodes

Example 1: Increasing capacity requirement:

MS Exchange Administrator: I’ve been told by the CEO to increase our mailbox limits from 1GB to 2GB but we don’t have enough capacity.

Nutanix: Let’s start small and add storage only nodes as the Nutanix cluster (storage pool) reaches 80% utilisation.

Example 2: Increasing flash capacity:

MS SQL DBA: We’re growing our mission critical database and now we’re hitting SATA for some day to day operations, we need more flash!

Nutanix: Let’s add some all flash storage only nodes.

Example 3: Increasing resiliency

CEO/CIO: We need to be able to tolerate failures and the infrastructure self heal but we have a secure facility which is difficult and time consuming to get access too, what can we do?

Nutanix: Let’s add some storage only nodes to ensure you have enough capacity (All Flash and/or Hybrid) to ensure sufficient capacity to tolerate “n” number of failures and rebuild the environment back to a fully resilient and performant state.

Example 4: Implementing Backup / Long Term Retention

CEO/CIO: We need to be able to keep 7 years of data for regulatory requirements and we need to be able to access it within 1hr.

Nutanix: We can either add storage only nodes to one or more existing clusters OR create a dedicated Backup/Retention cluster. Let’s start with enough capacity for Year 1, and then as capacity is required, add more storage only nodes as the cost per GB drops over time. Nutanix allows mixing of hardware generations so you’ll never be in a situation where you need to rip & replace.

Example 5: Supporting one or more Monster VMs

Server Administrator: We have one or more VMs with storage capacity requirements of 100TB each, but the largest Nutanix node we have only supports 20TB. What do we do?

Nutanix: The Distributed Storage Fabric (ADSF) allows a VMs data set to be distributed throughout a Nutanix cluster ensuring any storage requirement can be met. Adding storage only nodes will ensure sufficient capacity while adding resiliency/performance to all other VMs in the cluster. Cold data will be distributed throughout the cluster while frequently accessed data will remain local where possible within the local storage capacity on the node where the VM runs.

For more information on this use case see: What if my VMs storage exceeds the capacity of a Nutanix node?

Example 6: Performance for infrequently accessed data (cold data).

Server Administrator: We have always stored our cold data on SATA drives attached to our SAN because we have a lot of data and flash is expensive. One or twice a year we need to do a bulk read of our data for auditing/accounting purposes but it’s always been so slow. How can we solve this problem and give good performance while keeping costs down?

Nutanix: Hybrid Storage only nodes are a cost effective way to store cold data and combined with ADSF, Nutanix is able to deliver optimum read performance from SATA by reading from the replica (copy of data) with the lowest latency.

This means if a HDD or even a node is experiencing heavy load, ADSF will dynamically redirect Read I/O throughout the cluster to Deliver Increased Read Performance from SATA. This capability was released in 2015 and storage only nodes adding more spindles to a cluster is very complimentary to this capability.

Frequently asked questions (FAQ):

  1. How many storage only nodes can a single cluster support?
    1. There is no hard limit, typically cluster sizes are less than 64 nodes as it’s important to consider limiting the size of a single failure domain.
  2. How many Compute+Storage nodes are required to use Storage Only nodes?
    1. Two. This also allows N+1 failover for the nodes running VMs in the event a compute+storage node failed so VMs can be restarted. Technically, you can create a cluster with only storage only nodes.
  3. How does adding storage only node increase capacity for my monster VM?
    1. By distributing replicas of data throughout the cluster, thus freeing up local capacity for the running VM/s on the local node. Where a VMs storage requirement exceeds the local nodes capacity, storage only nodes add capacity and performance to the storage pool. Note: One VM even with only one monster vDisk can use the entire capacity of a Nutanix cluster without any special configuration.

Summary:

For many years Nutanix has supported and recommended the use of Storage only nodes to add capacity, performance and resiliency to Nutanix clusters.

Back to the Scalability, Resiliency and Performance Index.

Nutanix Resiliency – Part 9 – Self healing

Nutanix has a number of critically important & unique self healing capabilities which differentiate the platform from not only traditional SAN/NAS arrays but other HCI products.

Nutanix can fully automatically self heal not only from the loss of SSDs/HDDs/NVMe devices and node failure/s but also fully recover the management stack (PRISM) without user intervention.

First let’s go through the self healing of the data from device/node failure/s.

Let’s take a simple comparison between a traditional dual controller SAN and the average* size Nutanix cluster of eight nodes.

*Average is calculated by number of customers globally divide total nodes sold.

In the event of a single storage controller failure, the SAN/NAS is left with no resiliency and is at the mercy of the service level agreement (SLA) with the vendor to replace the component before resiliency (and in many cases performance) can be restored.

Compare that to Nutanix, and only one of the eight storage controllers (or 12.5%) are offline, leaving seven to continue serving the workloads and automatically restore resiliency, typically in just minutes as Part 1 demonstrated.

I’ve previously written a blog titled Hardware support contracts & why 24×7 4 hour onsite should no longer be required which covers this concept in more detail, but long story short, if restoring resiliency of a platform is dependant on the delivery of new parts, or worse, human intervention, the risk of downtime or dataloss is exponentially higher than a platform which can self heal back to a fully resilient state without HW replacement or human intervention.

Some people (or competitors) might argue, “What about a smaller (Nutanix) cluster?”.

I’m glad you asked, even a four node cluster can suffer a node failure and FULLY SELF HEAL into a resilient three node cluster without HW replacement or human intervention.

The only scenario where a Nutanix environment cannot fully self heal to a state where another node failure can be tolerated without downtime is a three node cluster. BUT, in a three node cluster, one node failure can be tolerated and data will be re-protected and the cluster will continue to function with just two nodes but a subsequent failure would result in downtime, but critically no data loss would occur.

Critically, Drive failures can still be tolerated in a degraded state where only two nodes are running.

Note: In the event of a node failure in a three node vSAN cluster, data is not re-protected and remains at risk until the node is replaced AND the rebuild is complete.

The only prerequisite for Nutanix to be able to perform the complete self heal of data (and even the management stack, PRISM) is that sufficient capacity exists within the cluster. How much capacity you ask, I recommend N-1 for RF2 configurations, or N+2 for RF3 configurations assuming two concurrent failures orone failure followed by a subsequent failure.

So worst case scenario for the minimum size cluster would be 33% for RF2 and 40% for a five node RF3 cluster. However, before the competitors break out the Fear, Uncertainty and Doubt (FUD), let’s look at how much capacity is required for self healing as the cluster sizes increase.

The following table shows the percentage of capacity required to fully self heal based on N+1 and N+2 for cluster sizes up to 32 nodes.

Note: These values assume the worst case scenario that all nodes are at 100% capacity so in the real world the overhead will be lower that the table indicates.

CapacityReservedForRebuild

As we can see, for an average size (eight node) cluster, the free space required is just 13% (rounded up from 12.5%).

If we take N+2 for an eight node cluster, the MAXIMUM free space required to tolerate two node failures and a full rebuild to a resilient state is still just 25%.

It is important to note that thanks to Nutanix Distributed Storage Fabric (ADSF), the free space does not need to account for large objects (e.g.: 256GB) as Nutanix uses 1MB extents which are evenly distributed throughout the cluster, so there is no wasted space due to fragmentation unlike less advanced platforms.

Note: The size of nodes in the cluster does not impact the capacity required for a rebuild.

A couple of advantages ADSF has over other platforms is that Nutanix does not have the concept of a “cache drive” or the construct of “disk groups”.

Using disk groups is a high risk to resiliency as a single “cache” drive failure can take an entire disk group (made up of several drives) offline forcing a much more intensive rebuild operation than is required. A single drive failure in ADSF is just that, a single drive failure and only the data on that drive needs to be rebuild, which is of course done in an efficient distributed manner (i.e.: A “Many to Many” operation as opposed to a “One to One” like other products).

The only time when a single drive failure causes an issue on Nutanix is with single SSD systems in which it’s the equivalent of a node failure, but to be clear this is not a limitation of ADSF, just that of the hardware specification chosen.

For production environments, I don’t recommend the use of single SSD systems as the Resiliency advantages outweigh the minimal additional cost of a dual SSD system.

Interesting point: vSAN is arguably always a single SSD system since a “Disk group” has just one “cache drive” making it a single point of failure.

I’m frequently asked what happens after a cluster self heals and another failure occurs. Back in 2013 when I started with Nutanix I presented a session at vForum Sydney where I covered this topic in depth. The session was standing room only and as a result of it’s popularity I wrote the following blog post which shows how a five node cluster can self heal from a failure into a fully resilient four node cluster and then tolerate another failure and self heal to a three node cluster.

This capability is nothing new and is far and away the most resilient architecture in the market even compared to newer platforms.

Scale Out Shared Nothing Architecture Resiliency by Nutanix

When you need to allow for failures of constructs such as “Disk Groups”, the amount of free space you need to reserve for failures in much higher as we can learn from a recent VMware vSAN article titled “vSan degraded device handling“.

Two key quotes to consider from the article are:

 we strongly recommend keeping 25-30% free “slack space” capacity in the cluster.

 

If the drive is a cache device, this forces the entire disk group offline

When you consider the flaws in the underlying vSAN architecture it becomes logical why VMware recommend 25-30% free space in addition to FTT2 (three copies of data).

Next let’s go through the self healing of the Management stack from node failures.

All components which are required to Configure, Manage, Monitor, Scale and Automate are fully distributed across all nodes within the cluster. There is no requirement for customers to deploy management components for core functionality (e.g.: Unlike vSAN/ESXi which requires VSAN).

There is also no need for users to make the management stack highly available, again unlike vSAN/ESXi.

As a result, there is no single point of failure with the Nutanix/Acropolis management layer.

Lets take a look at a typical four node cluster:

Below we see four Controller VMs (CVMs) which service one node each. In the cluster we have an Acropolis Master along with multiple Acropolis Slave instances.

Acropolis4nodecluster1

In the event the Acropolis Master becomes unavailable for any reason, an election will take place and one of the Acropolis Slaves will be promoted to Master.

This can be achieved because Acropolis data is stored in a fully distributed Cassandra database which is protected by the Acropolis Distributed Storage Fabric.

When an additional Nutanix node is added to the cluster, an Acropolis Slave is also added which allows the workload of managing the cluster to be distributed, therefore ensuring management never becomes a point of contention.Acropolis5NodeCluster

Things like performance monitoring, stats collection, Virtual Machine console proxy connections are just a few of the management tasks which are serviced by Master and Slave instances.

Another advantage of Nutanix is that the management layer never needs to be sized or scaled manually. There is no vApp/s , Database Server/s, Windows instances to deploy, install, configure, manage or license, therefore reducing cost and simplifying management of the environment.

Key point:

  1. The Nutanix Acropolis Management stack is automatically scaled as nodes are added to the cluster, therefore increasing consistency , resiliency, performance and eliminating potential for architectural (sizing) errors which may impact manageability.

The reason I’m highlighting a competitors product is because it’s important for customers to understand the underlying differences especially when it comes to critical factors such as resiliency for both the data and management layers.

Summary:

Nutanix ADSF provides excellent self healing capabilities without the requirement for hardware replacement for both the data and management planes and only requires the bare minimum capacity overheads to do so.

If a vendor led with any of the below statements (all true of vSAN), I bet the conversation would come to an abrupt halt.

  1. A single SSD is a single point of failure and causes multiple drives to concurrently go offline and we need to rebuild all that data
  2. We strongly recommend keeping 25-30% free “slack space” capacity in the cluster
  3. Rebuilds are a slow, One to One operation and in some cases do not start for 60 mins.
  4. In the event of a node failure in a three node vSAN cluster, data is not re-protected and remains at risk until the node is replaced AND the rebuild is complete.

When choosing a HCI product, consider it’s self healing capabilities for both the data and management layers as both are critical to the resiliency of your infrastructure. Don’t put yourself at risk of downtime by being dependant on hardware replacements being delivered in a timely manner. We’ve all experienced or at least heard of horror stories where vendor HW replacement SLAs have not been met due to parts not being available, so be smart, choose a platform which minimises risk by fully self healing.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums