Nutanix Resiliency – Part 9 – Self healing

Nutanix has a number of critically important & unique self healing capabilities which differentiate the platform from not only traditional SAN/NAS arrays but other HCI products.

Nutanix can fully automatically self heal not only from the loss of SSDs/HDDs/NVMe devices and node failure/s but also fully recover the management stack (PRISM) without user intervention.

First let’s go through the self healing of the data from device/node failure/s.

Let’s take a simple comparison between a traditional dual controller SAN and the average* size Nutanix cluster of eight nodes.

*Average is calculated by number of customers globally divide total nodes sold.

In the event of a single storage controller failure, the SAN/NAS is left with no resiliency and is at the mercy of the service level agreement (SLA) with the vendor to replace the component before resiliency (and in many cases performance) can be restored.

Compare that to Nutanix, and only one of the eight storage controllers (or 12.5%) are offline, leaving seven to continue serving the workloads and automatically restore resiliency, typically in just minutes as Part 1 demonstrated.

I’ve previously written a blog titled Hardware support contracts & why 24×7 4 hour onsite should no longer be required which covers this concept in more detail, but long story short, if restoring resiliency of a platform is dependant on the delivery of new parts, or worse, human intervention, the risk of downtime or dataloss is exponentially higher than a platform which can self heal back to a fully resilient state without HW replacement or human intervention.

Some people (or competitors) might argue, “What about a smaller (Nutanix) cluster?”.

I’m glad you asked, even a four node cluster can suffer a node failure and FULLY SELF HEAL into a resilient three node cluster without HW replacement or human intervention.

The only scenario where a Nutanix environment cannot fully self heal to a state where another node failure can be tolerated without downtime is a three node cluster. BUT, in a three node cluster, one node failure can be tolerated and data will be re-protected and the cluster will continue to function with just two nodes but a subsequent failure would result in downtime, but critically no data loss would occur.

Critically, Drive failures can still be tolerated in a degraded state where only two nodes are running.

Note: In the event of a node failure in a three node vSAN cluster, data is not re-protected and remains at risk until the node is replaced AND the rebuild is complete.

The only prerequisite for Nutanix to be able to perform the complete self heal of data (and even the management stack, PRISM) is that sufficient capacity exists within the cluster. How much capacity you ask, I recommend N-1 for RF2 configurations, or N+2 for RF3 configurations assuming two concurrent failures orone failure followed by a subsequent failure.

So worst case scenario for the minimum size cluster would be 33% for RF2 and 40% for a five node RF3 cluster. However, before the competitors break out the Fear, Uncertainty and Doubt (FUD), let’s look at how much capacity is required for self healing as the cluster sizes increase.

The following table shows the percentage of capacity required to fully self heal based on N+1 and N+2 for cluster sizes up to 32 nodes.

Note: These values assume the worst case scenario that all nodes are at 100% capacity so in the real world the overhead will be lower that the table indicates.

CapacityReservedForRebuild

As we can see, for an average size (eight node) cluster, the free space required is just 13% (rounded up from 12.5%).

If we take N+2 for an eight node cluster, the MAXIMUM free space required to tolerate two node failures and a full rebuild to a resilient state is still just 25%.

It is important to note that thanks to Nutanix Distributed Storage Fabric (ADSF), the free space does not need to account for large objects (e.g.: 256GB) as Nutanix uses 1MB extents which are evenly distributed throughout the cluster, so there is no wasted space due to fragmentation unlike less advanced platforms.

Note: The size of nodes in the cluster does not impact the capacity required for a rebuild.

A couple of advantages ADSF has over other platforms is that Nutanix does not have the concept of a “cache drive” or the construct of “disk groups”.

Using disk groups is a high risk to resiliency as a single “cache” drive failure can take an entire disk group (made up of several drives) offline forcing a much more intensive rebuild operation than is required. A single drive failure in ADSF is just that, a single drive failure and only the data on that drive needs to be rebuild, which is of course done in an efficient distributed manner (i.e.: A “Many to Many” operation as opposed to a “One to One” like other products).

The only time when a single drive failure causes an issue on Nutanix is with single SSD systems in which it’s the equivalent of a node failure, but to be clear this is not a limitation of ADSF, just that of the hardware specification chosen.

For production environments, I don’t recommend the use of single SSD systems as the Resiliency advantages outweigh the minimal additional cost of a dual SSD system.

Interesting point: vSAN is arguably always a single SSD system since a “Disk group” has just one “cache drive” making it a single point of failure.

I’m frequently asked what happens after a cluster self heals and another failure occurs. Back in 2013 when I started with Nutanix I presented a session at vForum Sydney where I covered this topic in depth. The session was standing room only and as a result of it’s popularity I wrote the following blog post which shows how a five node cluster can self heal from a failure into a fully resilient four node cluster and then tolerate another failure and self heal to a three node cluster.

This capability is nothing new and is far and away the most resilient architecture in the market even compared to newer platforms.

Scale Out Shared Nothing Architecture Resiliency by Nutanix

When you need to allow for failures of constructs such as “Disk Groups”, the amount of free space you need to reserve for failures in much higher as we can learn from a recent VMware vSAN article titled “vSan degraded device handling“.

Two key quotes to consider from the article are:

 we strongly recommend keeping 25-30% free “slack space” capacity in the cluster.

 

If the drive is a cache device, this forces the entire disk group offline

When you consider the flaws in the underlying vSAN architecture it becomes logical why VMware recommend 25-30% free space in addition to FTT2 (three copies of data).

Next let’s go through the self healing of the Management stack from node failures.

All components which are required to Configure, Manage, Monitor, Scale and Automate are fully distributed across all nodes within the cluster. There is no requirement for customers to deploy management components for core functionality (e.g.: Unlike vSAN/ESXi which requires VSAN).

There is also no need for users to make the management stack highly available, again unlike vSAN/ESXi.

As a result, there is no single point of failure with the Nutanix/Acropolis management layer.

Lets take a look at a typical four node cluster:

Below we see four Controller VMs (CVMs) which service one node each. In the cluster we have an Acropolis Master along with multiple Acropolis Slave instances.

Acropolis4nodecluster1

In the event the Acropolis Master becomes unavailable for any reason, an election will take place and one of the Acropolis Slaves will be promoted to Master.

This can be achieved because Acropolis data is stored in a fully distributed Cassandra database which is protected by the Acropolis Distributed Storage Fabric.

When an additional Nutanix node is added to the cluster, an Acropolis Slave is also added which allows the workload of managing the cluster to be distributed, therefore ensuring management never becomes a point of contention.Acropolis5NodeCluster

Things like performance monitoring, stats collection, Virtual Machine console proxy connections are just a few of the management tasks which are serviced by Master and Slave instances.

Another advantage of Nutanix is that the management layer never needs to be sized or scaled manually. There is no vApp/s , Database Server/s, Windows instances to deploy, install, configure, manage or license, therefore reducing cost and simplifying management of the environment.

Key point:

  1. The Nutanix Acropolis Management stack is automatically scaled as nodes are added to the cluster, therefore increasing consistency , resiliency, performance and eliminating potential for architectural (sizing) errors which may impact manageability.

The reason I’m highlighting a competitors product is because it’s important for customers to understand the underlying differences especially when it comes to critical factors such as resiliency for both the data and management layers.

Summary:

Nutanix ADSF provides excellent self healing capabilities without the requirement for hardware replacement for both the data and management planes and only requires the bare minimum capacity overheads to do so.

If a vendor led with any of the below statements (all true of vSAN), I bet the conversation would come to an abrupt halt.

  1. A single SSD is a single point of failure and causes multiple drives to concurrently go offline and we need to rebuild all that data
  2. We strongly recommend keeping 25-30% free “slack space” capacity in the cluster
  3. Rebuilds are a slow, One to One operation and in some cases do not start for 60 mins.
  4. In the event of a node failure in a three node vSAN cluster, data is not re-protected and remains at risk until the node is replaced AND the rebuild is complete.

When choosing a HCI product, consider it’s self healing capabilities for both the data and management layers as both are critical to the resiliency of your infrastructure. Don’t put yourself at risk of downtime by being dependant on hardware replacements being delivered in a timely manner. We’ve all experienced or at least heard of horror stories where vendor HW replacement SLAs have not been met due to parts not being available, so be smart, choose a platform which minimises risk by fully self healing.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

In recent weeks, I have seen numerous RFQs which have the requirement for 24×7 2 or 4hr onsite HW replacement, and while this is not uncommon I’ve been thinking why is this the case?

Over my I.T career spanning coming up on 15 years, in the majority of cases, I have strongly recommended in my designs and Bill of Materials (BoMs) that customers buy 24×7 4 hours onsite hardware maintenance contracts for equipment such as Compute, Storage Arrays , Storage Area Networking and IP network devices.

I have never found it difficult to justify this recommendation, because traditionally if a component in the datacenter fails, such as a Storage Controller, this generally has a high impact on the customers business and could cost tens or hundreds of thousands of dollars or even millions of dollars in revenue depending on the size of the customer.

Not only is loosing a Storage controller general a high impact, it is also a high risk as the environment may no longer have redundancy and a subsequent failure could (and would likely) result in a full outage.

So in this example, a typical storage solution has a Storage Controller failure resulting in degraded performance (due to loosing 50% of the controllers) and high impact/risk to the customer, a customer purchasing 24×7 4 Hour, or even 24×7 2hr support contract makes perfect sense! The question is, why choose HW (or a solution) which puts you at high risk after a single component failure in the first place?

With technology fast changing and over the last year or so, I’ve been involved in many customer meetings where I am asked what I recommend in terms of hardware maintenance contracts (for Nutanix customers).

Normally this question/conversation happens after the discussion about the technology, where I explain various failure scenarios and how resilient a Nutanix cluster is.

My recommendation goes something like this.

If you architect your solution for your desired level of availability (e.g.: N+2) there is no need to buy 24×7 4hr hardware maintenance contract, the default Next Business Day option is perfectly fine.

Justification:

1. In the event of even an entire node failure, the Nutanix cluster will have automatically self healed the configured resiliency factor (2 or 3) well before even a 2hr support contract can provide a technician to be onsite, diagnose the issue and replace hardware.

2. Assuming the HW is replaced on the 2hr mark (not typical in my experience), AND assuming Nutanix was not automatically self healing prior to the drive/node replacement, the replacement drive or node would then START the process of self healing. So the actual time to recovery would be greater than 2hrs. In the case of Nutanix, self heal begins almost immediately.

3. If a cluster is sized for the desired level of availability based on business requirements, say N+2, a Node can fail, Nutanix will automatically self heal and then tolerate a subsequent failure with the ability to full self heal the configured resiliency factor (2 or 3) again.

4. If a cluster is sized only to customer requirement of only N+1, a Node can fail, Nutanix will automatically and fully self heal. Then in the unlikely (but possible) event of a subsequent failure (i.e.: A 2nd node failure before the next business day warranty replaces the failed HW), the Nutanix cluster will still continue to operate.

5. The performance impact of a node failure in a Nutanix environment is N-1, so in a worst case scenario (3 node cluster) the impact is 33%, compared to a 2 controller SAN/NAS where the impact would be 50%. In a 4 node cluster the impact is only 25% and for customer with say 8 nodes only 12.5%. The bigger the cluster the lower the impact. Nutanix recommends N+1 up to 16 nodes, and N+2 up to 32 nodes. Beyond 32 nodes higher levels of availability may be desired based on customer requirements.

The risk and impact of the failure scenario/s is key, in the case of Nutanix, because of the self healing capability, and the fact all controllers and SSDs/HDDs in the cluster participate in the self heal, it can be done very quickly and with low impact. So the impact of the failure is low (N-1) and the recovery is done quickly, so the risk to the business is low, therefore dramatically reducing (and in my opinion potentially removing) the requirement for a 24×7 2 or 4hr support contract for Nutanix customers.

In Summary:

1. The decision on what hardware maintenance contract is appropriate is a business level decision which should be based in part on a comprehensive risk assessment done by an experienced enterprise architect, intimately familiar with all the technology being used.

2. If the recommendation from the trusted experienced enterprise architect is that the risk of HW failure causing high impact or outage to the business is so high that purchasing a 4hr or 2hr onsite HW replacement is required, my advise would be to reconsider if the proposed “solution” meets the business requirements. Only if you are constrained to that solution, purchase a 24×7 2 or 4hr support contract.

3. Being heavily dependant on Hardware being replaced to restore resiliency / performance for a solution, is in itself a high risk to the business.

AND

4. In my experience, it is not uncommon to have problems getting onsite support or hardware replacement regardless of the support contract / SLA. Sometimes this is outside a vendors control, but most vendors will experience one or more of these issues which I have personally experienced on numerous occasions in previous roles:

a) Vendors failing to meet SLA for onsite support.
b) Vendors failing to have the required parts available within the SLA.
c) Replacement HW being refurbished (common practice) and being faulty.
d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

Note: Support contracts don’t promise a resolution by the 2hr / 4hr contract, they simply promise somebody will be onsite and in some cases this is only after you have gone through troubleshooting with the vendor on the phone, sent logs for analysis and so on. So the reality is, the 2hr or 4hr part doesn’t hold much value.

If you have accepted the solution being sold to you OR your an architect recommending a solution which is enterprise grade and highly resilient with self healing capabilities, then consider why you need a 24×7 2hr or 4hr hardware maintenance contract if the solution is architected for the required availability level (i.e.: N+1 / N+2 etc)

So with your next infrastructure purchase (or when making your recommendations if you’re an architect), carefully consider what solution your investing in (or proposing), and if you feel an aggressive 2hr/4hr HW support contract is required, I would recommend revisiting the requirements as you may well be buying (or recommending) something that isn’t resilient enough to meet the requirements.

Food for thought.

Example Architectural Decision – Default Virtual Machine Compatibility Configuration

Problem Statement

In a VMware vSphere 5.5 environment, what is the most suitable configuration for Virtual Machine Compatibility setting at the Datacenter and Cluster layers?

Assumptions

1. vSphere Flash Read Cache is not required.
2. VMDKs of greater than 2TB minus 512b are not required.

Motivation

1. Reduce complexity where possible.
2. Maximize supportability.

Architectural Decision

Configure the vSphere Datacenter level “Default VM Compatibility” as “ESXi 5.1 or later” and leave the vSphere Cluster level “Default VM Compatibility” as “Use datacenter setting and host version” (default).

Justification

1. Avoid limiting management of the environment to the vSphere Web Client.
2. The Default VM Compatibility only needs to be set once at the datacenter layer and then all clusters within the datacenter will inherit the desired setting.
3. Reduce the dependency of the Web Client in the event of a disaster recovery.
4. As vFRC and >2TB VMDKs and vGPU are not required, there is no significant advantage to HW Version 10.
5. Ensuring a standard virtual machine compatibility level is maintained throughout the environment and reducing the chance of mismatched VM version types in the environment.
6. Simplicity.

Implications

1. Virtual Machine Hardware Compatibility automatic update must be DISABLED to prevent the VM hardware being automatically upgraded following a shutdown.
2. vSphere Flash Read Cache (vFRC) cannot be used.
3. VMDKs will remain limited at 2TB minus 512b.

Alternatives

1. Virtual Machine HW Version 10 (vSphere 5.5 onwards).
2. Virtual Machine HW Version 8 (vSphere 5.0 onwards).
3. Virtual Machine HW Version 7 (vSphere 4.1 onwards).
4. Older Virtual machine HW versions.