Nutanix Resiliency – Part 5 – Read I/O during CVM maintenance or failures?

In the earlier parts of this series we’ve talked about how ADSF can recovery quickly from a node failure by re-protecting data in a distributed manner across the cluster. We also covered how resiliency can be increased from Resiliency Factor 2 (RF2) to RF3 and even changed to a more space efficient Erasure Coding (EC-X) configuration all without interruption.

Now let’s cover the critically important topic of how VMs are impacted during Nutanix Controller VM (CVM) maintenance such as AOS upgrades OR during failures such as the CVM crashing or even being accidentally or maliciously turned off.

Let’s quickly cover the basics of how Nutanix ADSF writes and protects data.

Looking an the following diagram we see a three node cluster with a single Virtual Machine. The VM has written some data represented by a,b,c & d & under normal circumstances all writes will have one replica written to the host running the VM (in this case Node 1) and the other replica (or replicas in the case of RF3) distributed throughput the cluster based on disk fitness values. The disk fitness values (or what I call “Intelligent replica placement”) ensure data is placed in the most optimal place the first time.

RF2Overview

If one or more nodes are added to the cluster, the Intelligent replica placement will send proportionally more replicas to those nodes until the cluster is in a balanced state. In the very unlikely even no new writes are occurring, ADSF has a background disk balancing process which will balance the cluster as a low priority.

Now that we know the basics of how Nutanix protects data using multiple replicas (called “Resiliency Factor”) let’s talk about what happens during a Nutanix ADSF storage layer upgrade.

Upgrades are initiated by a one-click process and performed in rolling style one controller VM (CVM) at a time regardless of the configured Resiliency Factor and if Erasure Coding (EC-X) is used or not. The rolling upgrade put simply takes one CVM offline at a time, performs the upgrade, performs and self check and then rejoins the cluster and then repeats the process on the next CVM.

One of the many advantages of Nutanix decoupling the storage from the hypervisor (i.e.: not embedding storage into the kernel) is that upgrades and even storage layer failures do not impact the running Virtual machines.

VMs do not need to be restarted (i.e.: Like a HA event) nor do they need to migrate (e.g.: vMotion) to another node. VMs continue without interruption to storage traffic even when the local controller is offline for any reason.

If the local CVM is down for maintenance or due to failure, the I/O is dynamically re-directed throughout the cluster.

Let’s look at a Read I/O when the CVM local to a VM is offline (for any reason).

The local CVM being offline means the physical drives (NVMe, SSD, HDD etc) are not available meaning the local data (replicas) is unavailable.

All read I/O will be redirected and continue to function as it will now be served by all CVMs in the cluster.

ReadIOServedRemotelyWhenCVMifOffline

This maintenance/failure scenario could be compared to a 3 Tier architecture in that the node running the VM is not currently providing storage and is connecting to the storage over a network. But as Nutanix is a distributed architecture all nodes within the cluster service the reads meaning in the worst case scenario of a three node cluster, during a failure or maintenance Nutanix has an equivalent architecture to an optimally performing dual controller storage array.

Let’s cover that one more time, in the WORST case scenario where the smallest cluster has suffered a failure (or maintenance) causing the read IO to be served remotely, Nutanix in a degraded state is at worst equivalent to a compute node accessing a dual controller storage array in it’s OPTIMAL state.

If the Nutanix cluster was for example eight nodes and one node was performing maintenance or the CVM was down for any reason, seven nodes would be serving IO to the VMs on that node. This process is actually nothing new and something Nutanix has done for a long time. It’s described in more detail in Acropolis Hypervisor (AHV) I/O Failover & Load Balancing which was published in July 2015.

Once the local CVM is back online, Read I/O is once again serviced by the local CVM and the only remote reads which occur will be in the case where a copy of data does not exist on the local node. When remote read/s occur, the 1MB extent which holds the data being read will be localised to allow subsequent reads to be local. It’s critical to understand the process of localising the extent (replica) adds no additional overhead on the network compared to a remote read so localising benefits performance without additional overheads.

Summary:

  1. ADSF writes data on the node where the VM resides to ensure subsequent reads are local.
  2. Read I/O is serviced by the local CVM and when the Local CVM is unavailable for any reason the read I/O is serviced by all CVMs in the cluster in a distributed manner
  3. Virtual machines do not need to be failed over or evacuated from a node when the local CVM is offline due to maintenance or failure
  4. In the worst case scenario of a 3 node cluster and a CVM down, a virtual machine running on Nutanix has it’s traffic serviced by at least two storage controllers which is the best case scenario for a Server + Dual Controller Storage Array (3 Tier) architecture.
  5. In clusters larger than three, Virtual machines on Nutanix enjoy more storage controllers serving their read I/O than an optimal scenario for a Server + Dual Controller Storage Array (3 Tier) architecture.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Nutanix AHV/AOS Functionality – Removing nodes

A Nutanix ADSF (Acropolis Distributed Storage Fabric) is designed to live forever, meaning as new nodes are added and older nodes removed, the cluster remains online and critically, in a fully resilient state at all times.

While this might not sound that critical, it avoids problems which have plagued legacy (and even many modern) datacenter products where forklift upgrades/replacements are not only complex, high risk and time consuming, they typically also reduce the resiliency of the platform throughout the process.

A common example of reduced resiliency is where one (of two) SAN/NAS controllers is taken offline during a fork lift storage controller upgrade, meaning a single failure can cause the storage to be offline.

Nutanix has now been shipping product for around 5 years so we have had many customers go through hardware refresh cycles, and many more who are about to embark on a HW refresh.

I thought I would quickly demonstrate how easy it is to remove an old node from a cluster and ensure existing and prospective Nutanix customers have the facts about the node removal process.

Firstly lets look at the environment the demonstration is performed on.

We have an AHV environment with 8 nodes with a mix of NX3050 and NX6050 spread over 3 blocks as shown in Nutanix PRISM UI (below).

EnvironmentSummary

To remove a host, all we need to do is go to the hardware tab in PRISM, click the host we want to remove and select Remove Host as shown below.

RemoveHost

No preparation tasks are required at all which also means less planning and change control is required. Once you select Remove Host, the host enters maintenance mode and starts performing the required tasks to remove the node as shown below.

RemoveHost2

As you can see, Acropolis OS (AOS) is removing each individual disk from the cluster before taking the node out of the cluster. This means the configured Resiliency Factor (RF) is always in compliance, ensuring that data is still available even in the event of a drive or node failure. This can be observed on the PRISM Home screen in the data resiliency view shown below.

DataResiliencyStatus

This process is handled by the curator function of AOS and because data is distributed throughout all nodes within the cluster, the process is both lower impact than traditional RAID based solutions or solutions using RAID+Replication, as well as faster because all nodes and therefore CVMs, SSDs and HDDs participate in the process. Nutanix ADSF does not mirror or replicate data from one node to another node, but to and from all nodes. This eliminates the potential bottleneck of a single node.

The following shows the speed at which Nutanix Distributed Storage Fabric (ADSF) performs the data migration even when the majority of data resides on the HDD tier (including in this example).

StoragePoolPerfNodeRemove

For a cluster with 20 x 1TB and 20 x 4TB SATA spindles for a total of 100TB of SATA and just 6.4TB SSD (or approx 6.5%) the node removal rate where it reached >830MBps quite impressive since most of the extents (data) which needed to be replicated throughout the cluster were retrieved from SATA tier.

The rate at which a node can be removed will vary depending on the front end I/O, node types and cluster size with larger cluster sizes able to remove nodes faster due to more available controllers (CMVs) and importantly more choice of source and destination of extents.

The process can be monitored via the Tasks view (shown earlier) or at a very granular level such as per disk (SSD or HDD).

The below shows us the status of the disk is Migrating Data and it also shows the drive had a significant amount of data on it as this was not an empty cluster demonstration. In fact this screen shot was taken about halfway through the node removal process.

DiskStatus

So many of you may be wondering what the CVM CPU utilisation is throughout this process During the process I took the following screenshot showing the eight Controller VMs, there vCPU configuration (8 vCPUs) and the CPU utilisation.

CVMCPUutilRemoveHost

As we can see, the utilisation ranges from just 6% through to 16% with an average of just under 10%. It should be noted these nodes are using Intel Ivy Bridge processors so with latest generation Intel Broadwell chipsets the process would use less percentage of CPU and perform faster (due to higher per core performance) than on this 3 year old equipment.

Note: The CVM is not just doing IO processing. It is providing the full AHV / AOS management stack which makes the fact the CVM is using under 10% CPU even more impressive.

The Remove host task also resets the configuration of the Controller VM (CVM) back to default which ensures the node can be quickly/easily added to a new or existing cluster.

The end result is a fully functional 7 node cluster as shown below.

EndResultNodeRemoval

Summary:

Node removal from a Nutanix cluster (regardless of hypervisor) is a 1-Click, Non disruptive operation which maintains cluster resiliency at all times while being a fast and low impact process.

Related Articles:

1. VMware you’re full of it (FUD) : Nutanix CVM/AHV & vSphere/VSAN overheads

2. Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor

3. Think HCI is not an ideal way to run mission-critical x86 workloads? Think Again!

Enterprise Architecture & Avoiding tunnel vision.

Recently I have read a number of articles and had several conversations with architects and engineers across various specialities in the industry and I’m finding there is a growing trend of SMEs (Subject Matter Experts) having tunnel vision when it comes to architecting solutions for their customers.

What I mean by “Tunnel Vision” is that the architect only looks at what is right in front of him/her (e.g.: The current task/project) , and does not consider the implications of how the decisions being made for this task may impact the wider I.T infrastructure and customer from a commercial / operational perspective.

In my previous role I saw this all to often, and it was frustrating to know the solutions being designed and delivered to the customers were in some cases quite well designed when considered in isolation, but when taking into account the “Big Picture” (or what I would describe as the customers overall requirements) the solutions were adding unnecessary complexity, adding risk and increasing costs, when new solutions should be doing the exact opposite.

Lets start with an example;

Customer “ACME” need an enterprise messaging solution and have chosen Microsoft Exchange 2013 and have a requirement that there be no single points of failure in the environment.

Customer engages an Exchange SME who looks at the requirements for Exchange, he then points to a vendor best practice or reference architecture document and says “We’ll deploy Exchange on physical hardware, with JBOD & no shared storage and use Exchange Database Availability Groups for HA.”

The SME then attempts to justify his recommendation with “because its Microsoft’s Best practice” which most people still seem to blindly accept, but this is a story for another post.

In fairness to the SME, in isolation the decision/recommendation meets the customers messaging requirements, so what’s the problem?

If the customers had no existing I.T and the messaging system was going to be the only I.T infrastructure and they had no plans to run any other workloads, I would say the solution proposed could be a excellent solution, but how many customers only run messaging? In my experience, none.

So lets consider the customer has an existing Virtual environment, running Test/Dev, Production and Business Critical applications and adheres to a “Virtual First” policy.

The customer has already invested in virtualization & some form of shared storage (SAN/NAS/Web Scale) and has operational procedures and expertises in supporting and maintaining this environment.

If we were to add a new “silo” of physical servers, there are many disadvantages to the customer including but not limited too;

1. Additional operational documentation for new Physical environment.

2. New Backup & Disaster Recovery strategy / documentation.

3. Additional complexity managing / supporting a new Silo of infrastructure.

4. Reduced flexibility / scalability with physical servers vs virtual machines.

5. Increased downtime and/or impact in the event hardware failures.

6. Increased CAPEX due to having to size for future requirements due to scaling challenges with physical servers.

So what am I getting at?

The cost of deploying the MS Exchange solution on physical hardware could potentially be cheaper (CAPEX) Day 1 than virtualizing the new workload on the existing infrastructure (which likely needs to be scaled e.g.: Disk Shelves / Nodes) BUT would likely result overall higher TCO (Total Cost of Ownership) due to increased complexity & operational costs due to the creation of a new silo of resources.

Both a physical or virtual solution would likely meet/exceed the customers basic requirement to serve MS Exchange, but may have vastly different results in terms of the big picture.

Another example would be a customer has a legacy SAN which needs to be replaced and is causing issues for a large portion of the customers workloads, but the project being proposed is only to address the new Enterprise messaging requirements. In my opinion a good architect should consider the big picture and try to identify where projects can be combined (or a projects scope increased) to ensure a more cost effective yet better overall result for the customer.

If the architect only looked at Exchange and went Physical Servers w/ JBOD, there is zero chance of improvement for the rest of the infrastructure and the physical equipment for Exchange would likely be oversized and underutilized.

It will in many cases be much more economical to combine two or more projects, to enable the purchase of a new technology or infrastructure components and consolidate the workloads onto shared infrastructure rather than building two or more silo’s which add complexity to the environment, and will likely result in underutilized infrastructure and a solution which is inferior to what could have been achieved by combining the projects.

In conclusion, I hope that after reading this article, the next time you or your customers embark on a new project, that you as the Architect, Project Manager, or Engineer consider the big picture and not just the new requirement and ensure your customer/s get the best technical and business outcomes and avoid where possible the use of silos.