Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 6 – Performance

When talking about performance, it’s easy to get caught up in comparing unrealistic speed and feeds such as 4k I/O benchmarks. But, as any real datacenter technology expert knows, IOPS are just a small piece of the puzzle which, in my opinion, get far too much attention as I discussed in my article Peak Performance vs Real World Performance.

When I talk about performance, I am referring to all the components within the datacenter including the Management components, Applications/VMs, Analytics, Data Resiliency and everything in between.

Let’s look at a few examples of how Nutanix XCP running Acropolis Hypervisor (AHV) ensures consistent high performance for all components:

Management Performance:

The Acropolis management layer includes the Acropolis Operating System (formally NOS), Prism (HTML 5 GUI) and Acropolis Hypervisor (AHV) management stack made up of “Master” and “Slave” instances.

This architecture ensures all CVMs actively and equally contribute to ensuring all areas of the platform continue running smoothly. This means there is no central application, database or component which can cause a bottleneck, being fully distributed is key to delivering a web-scale platform.

AcropolisCluster1

Each Controller VM (CVM) runs the components required to manage the local node and contribute to the distributed storage fabric and management tasks.

For example: While there is a single Acropolis “Master” it is not a single point of failure nor is it a performance bottleneck.

The Acropolis Master is responsible for the following tasks:

  1. Scheduler for HA
  2. Network Controller
  3. Task Executors
  4. Collector/Publisher of local stats from Hypervisor
  5. VNC Proxy for VM Console connections
  6. IP address management

Each Acropolis Slave  is responsible for the following tasks:

  1. Collector/Publisher of local stats from Hypervisor
  2. VNC Proxy for VM Console connections

Regardless of being a Master or Slave, each CVM performs the two heaviest tasks: The Collection & Publishing of Hypervisor stats and, when in use, the VM console connections.

The distributed nature of the XCP platform allows it too achieve consistently high performance. Sending stats to a central location such as a central management VM and associated database server not only can become a bottleneck, but without introducing some form of application level HA (e.g.: SQL Always On Availability Group) it also could be a single point of failure which is for most customers unacceptable.

The roles which are performed by the Acropolis Master are all lightweight tasks such as the HA scheduler, Network Controller, IP address management and Task Executor.

The HA scheduler task is only active in the event of a node failure which makes it a very low overhead for the Master. The Network Controller task is only active when tasks such as new VLANs are being configured and Task Execution is simply keeping track of all tasks and distributing them for execution across all CVMs. IP address management is essentially a DHCP service, which is also an extremely low overhead.

In part 8, we will discuss more about Acropolis Analytics.

Data Locality

Data locality is a unique feature of XCP where new I/O writes to the local node where the VM is running as well as replicated to other node/s within the cluster. Data locality eliminates the requirement for servicing subsequent Read I/O by traversing the network and utilizing a remote controller.

As VMs migrate around a cluster, Write I/O is always written locally and remote reads will only occur if remote data is accessed. If data is remote and never accessed, no remote I/O will occur. As a result, it is typical for >90% of I/O to be serviced locally.

Currently bandwidth and latency across a well designed 10Gb network may not be an issue for some customers, however as flash performance exponentially increases the network could quite easily become a major bottleneck without moving to expensive 40Gb (or higher) networking. Data locality helps minimize the dependency on the network by servicing the majority of Read I/O locally and by writing one copy locally it reduces the overheads on the network for Write I/O. Therefore Data Locality allows customers to run lower cost networking without compromising performance.

While data locality works across all supported hypervisors,  AHV is unique as it supports data-aware virtual machine placement:  Virtual Machines are powered onto the node with the highest percentage of local data for that VM which minimizes the chance of remote I/O and reduces the overheads involved in servicing I/O for each VM following failures or maintenance.

In addition, Data Locality also applies to the collection of back end data for Analysis such as hypervisor and virtual machine statistics. As a result, statistics are written locally and a second (or third for environments configured with RF3) written remotely. This means stats data which can be a significant amount of data has the lowest possible impact on the Distributed File System and cluster as a whole.

Summary:

  1. Management components scale with the cluster to ensure consistent performance
  2. Data locality ensures data is as close to the Compute (VM) as possible
  3. Intelligent VM placement based on Data location
  4. All Nutanix Controller VMs work as a team (not in pairs) to ensure optimal performance of all components and workloads (VMs) in the cluster

Back to the Index

Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 5 – Resiliency

When discussing resiliency, it is common to make the mistake of only looking at data resiliency and not considering resiliency of the storage controllers and the management components required to service the business applications.

Legacy technologies such as RAID and Hot Spare drives may in some cases provide high resiliency for data, however if they are backed by a dual controller type setups which cannot scale out and self heal, the data may be unavailable or performance/functionality severely degraded following even a single component failure. Infrastructure that is dependant on HW replacement to restore resiliency following a failure is fundamentally flawed as I have discussed in: Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

In addition if the management application layer is not resilient, then data layer high-availability/resiliency may be irrelevant as the business applications may not be functioning properly (i.e.: At normal speeds) or at all.

The Acropolis platform provides high resiliency for both the data and management layers at a configurable N+1 or N+2 level (Resiliency Factor 2 or 3) which can tolerate up to two concurrent node failures without losing access to Management or data. In saying that, with “Block Awareness”, an entire block (up to four nodes) can fail and the cluster still maintains full functionality. This puts the resiliency of data and management components on XCP up to N+4.

In addition, the larger the XCP cluster, the lower the impact of a node/controller/component failure. For a four node environment, N-1 is 25% impact whereas for an 8 node cluster N-1 is just a 12.5% impact. The larger the cluster the lower the impact of a controller/node failure. In contrast a dual controller SAN has a single controller failure, and in many cases the impact is 50% degradation and a subsequent failure would result in an outage. Nutanix XCP environments self heal so that even for an environment only configured for N-1, it is possible following a self heal than subsequent failures can be tolerated without causing high impact or outages.

In the event the Acropolis Master instance fails, full functionality will return to the environment after an election which completes within <30 seconds. This equates to management availability greater than “six nines” (99.9999%). Importantly, AHV has this management resiliency built-in; it requires zero configuration!

For more information see: Acropolis: Scalability

As for data availability, regardless of hypervisor the Nutanix Distributed Storage Fabric (DSF) maintains two or three copies of data/parity and in the event of a SSD/HDD or node failure, the configured RF is restored by all nodes within the cluster.

Data Resiliency

While we have just covered why resiliency of data is not the only important factor, it is still key. After all, if a solution which provides shared storage looses data, its not fit for purpose in any datacenter.

As data resiliency is such a foundation to the Nutanix Distributed Storage Fabric, the Data resiliency status is displayed on the Prism Home Screen. In the below screenshot we can see is that the ability to provide resiliency in both steady state and in the event of a failure (Rebuild Capacity) are both tracked.

In this example, all data in the cluster is compliant with the configured Resiliency Factor (RF2 or 3) and the cluster has at least N+1 available capacity to rebuild after the loss of a node.

dataresiliency1

To dive deeper into the resiliency status, simply click on the above box and it will expand to show more granular detail of the failures which can be tolerated.

The below screen shot shows things like Metadata, OpLog (Persistent Write Cache) and back end functions such as Zookeeper are also monitored and alerted when required.

resiliency2

In the event either of these is not in a normal or “Green” state, PRISM will alert the administrator. In the event the alert is the cause of a node failure, Prism automatically notifies Nutanix support (via Pulse) and dispatches the required part/s, although typically an XCP cluster will self-heal long before delivery of hardware even in the case of an aggressive Hardware Maintenance SLA such as 4hr Onsite.

This is yet another example of Nutanix not being dependent on Hardware (replacement) for resiliency.

Data Integrity

Acknowledging a Write I/O to a guest operating system should only occur once the data is written to persistent media because until this point, it is possible for data loss to occur even when storage is protected by battery backed cache and uninterruptible power supplies (UPS).

The only advantage to acknowledging writes before this has occurred is performance, but what good is performance when your data lacks integrity or is lost?

Another commonly overlooked requirement of any enterprise grade storage solution is the ability to detect and recover from Silent Data Corruption. Acropolis performs checksums in software for every write AND on every read. Importantly Nutanix is in no way dependent on the underlying hardware or any 3rd party software to maintain data integrity, all check summing and remediation (where required) is handled natively.

Pro tip: If a storage solution does not perform checksums on Write AND Read, DO NOT use it for production data.

In the event of Silent Data Corruption (which can impact any storage device from any vendor), the checksum will fail and the I/O will be serviced from another replica which is stored on a different node (and therefore physical SSD/HDD). If a checksum fails in an environment with Erasure Coding, EC-X recalculates the data the same way as if a HDD/SSD failed and services the I/O.

In the background, the Nutanix Distributed Storage Fabric will discard the corrupted data and restore the configured Resiliency Factor from the good replica or stripe where EC-X is used.

This process is completely transparent to the virtual machine and end user, but is a critical component of the XCP’s resiliency. The underlying Distributed Storage Fabric (DFS) also automatically protects all Acropolis management components, this is an example of one of the many advantages of the Acropolis architecture where all components are built together, not bolted on afterwards.

An Acropolis environment with a container configured with RF3 (Replication Factor 3) provides N+2 management availability. As a result, it would take an extraordinarily unlikely failure of three concurrent node failures before a management outage could potentially occur. Luckily XCP has an answer for this albeit unlikely scenario as well, Block Awareness is a capability where with 3 or more blocks the cluster can tolerate the failure of an entire block (up to 4 nodes) without causing data or management to go offline.

Part of the Acropolis story around resiliency goes back to the lack of complexity. Acropolis enables rolling 1-click upgrades and includes all functionality. There is no single point of failure; in the worst-case scenario if the node with Acropolis master fails, within 30 seconds the Master role will restart on a surviving node and initiate VMs to power on. Again this is in-built functionality, not additional or 3rd party solutions which need to be designed/installed & maintained.

The above points are largely functions of the XCP rather than AHV itself, so I thought I would highlight a AHV’s Load Balancing and failover capabilities.

Unlike traditional 3-tier infrastructure (i.e.: SAN/NAS) Nutanix solutions do not require multi-pathing as all I/O is serviced by the local controller. As a result, there is no multi-pathing policy to choose which removes another layer of complexity and potential point of failure.

However in the event of the local CVM being unavailable for any reason we need to service I/O for all the VMs on the node in the most efficient manner. AHV does this by redirecting I/O on a per vDisk level to a random remote stargate instance as shown below.

pervmpathfailover

AHV can do this because every vdisk is presented via iSCSI and is its own target/LUN which means it has its own TCP connection. What this means is a business critical application such as MS SQL / Exchange or Oracle with multiple vDisks will be serviced by multiple controllers concurrently.

As a result all VM I/O is load balanced across the entire Acropolis cluster which ensures no single CVM becomes a bottleneck and VMs enjoy excellent performance even in a failure or maintenance scenario.

For more information see: Acropolis Hypervisor (AHV) I/O Failover & Load Balancing

Summary:

  1. Out of the box self healing capabilities for:
    1. SSD/HDD/Node failure/s
    2. Acropolis and PRISM (Management layer)
  2. In-Built Data Integrity with software based checksums
  3. Ability to tolerate up to 4 concurrent node failures
  4. Management availability of >99.9999 (Six “Nines”)
  5. No dependency on Hardware for data or management resiliency

For more information see: Ensuring Data Integrity with Nutanix – Part 2 – Forced Unit Access (FUA) & Write Through

Back to the Index

How to Architect a VSA , Nutanix or VSAN solution for >=N+1 availability.

How to architect a VSA, Nutanix or VSAN solution for the desired level of availability (i.e.: N+1 , N+2 etc) is a question I am asked regularly by customers and contacts throughout the industry.

This needs to be addressed in two parts.

1. Compute
2. Storage

Firstly, Compute level resiliency, As a cluster grows, the chances of a failure increases so the percentage of resources reserved for HA should increase with the size of the cluster.

My rule of thumb (which is quite conservative) is as follows:

1. N+1 for clusters of up to 8 hosts
2. N+2 for clusters of >8 hosts but <=16
3. N+3 for clusters of >16 hosts but <=24
4. N+4 for clusters of >24 hosts but <=32

The above is discussed in more detail in : Example Architectural Decision – High Availability Admission Control Setting and Policy

The below table highlights in Green my recommended HA percentage configuration based on the cluster size, up to the current vSphere limit of 32 nodes.

HApercentages

Some of you may be thinking, if my Nutanix or VSAN cluster is only configured for RF2 or FT1 for VSAN, I can only tolerate one node failure, so why am I reserving more than N+1.

In the case of Nutanix, after a node failure, the cluster can restore itself to a fully resilient state and tolerate subsequent failures. In fact, with “Block Awareness” a full 4 node block can be lost (so an N-4 situation) which if this is a requirement, needs to be considered for HA admission control reservations to ensure compute level resources are available to restart VMs.

Next lets talk about the issue perceived to be more complicated, Storage redundancy.

Storage redundancy for VSA, Nutanix or VSAN is actually not as complicated as most people think.

The following is my rule of thumb for sizing.

For N+1 , Ensure you have enough capacity remaining in the cluster to tolerate the largest node failing.

For N+2, Ensure you have enough capacity remaining in the cluster to tolerate the largest TWO nodes failing.

The examples below discuss Nutanix nodes and their capacity, but the same is applicable to any VSA or VSAN solution where multiple copies of data is kept for data protection, as opposed to RAID.

Example 1 , If you have 4 x Nutanix NX3060 nodes configured with RF2 (FT1 in VSAN terms) with 2TB usable per node (as shown below), in the event of a node failure, 2TB is no longer available. So the maximum storage utilization of the cluster should be <75% (6TB) to ensure in the event of any node failure, the cluster can be restored to a fully resilient state.

4node3060

Example 2 , If you have 2 x Nutanix NX3060 nodes configured with RF2 (FT1 in VSAN terms) with 2TB usable per node and 2 x Nutanix NX6060 nodes with 8TB usable per node (as shown below), in the event of a NX6060 node failure, 8TB is no longer available. So the maximum storage utilization of the cluster should be 12TB to ensure in the event of any node failure (including the 8TB nodes), the cluster can be restored to a fully resilient state.

4nodemixed

For environments using Nutanix RF3 (3 copies of data) or VSAN (FT2) the same rule of thumb applies but the usable capacity per node would be lower due to the increased capacity required for data protection.

Specifically for Nutanix environments, the PRISM UI shows if a cluster has sufficient capacity available to tolerate a node failure, and if not the following is displayed on the HOME screen and alerts can be sent if desired.

CapacityCritical

In this case, the cluster has suffered a node failure, and because it was sized suitably, it shows “Rebuild Capacity Available” as “Yes” and advises an “Auto Rebuild in progress” meaning the cluster is performing a fully automated self heal. Importantly no admin intervention is required!

If the cluster status is normal, the following will be shown in PRISM.

CapacityOK

In summary: The smaller the cluster the higher the amount of capacity needs to remain unused to enable resiliency to be restored in the event of a node failure, the same as the percentage of resources reserved for HA in a traditional compute only cluster.

The larger the cluster from both a storage and compute perspective, the lower the unused capacity is required for HA, so as has been a virtualization recommended practice for years….. Scale-out!

Related Articles:

1. Scale Out Shared Nothing Architecture Resiliency by Nutanix

2. PART 1 – Problems with RAID and Object Based Storage for data protection

3. PART 2 – Problems with RAID and Object Based Storage for data protection