Nutanix Scalability – Part 1 – Storage Capacity

It never ceases to amaze me that analysts as well as prospective/existing customers frequently are not aware of the storage scalability capabilities of the Nutanix platform.

When I joined back in 2013, a common complaint was that Nutanix had to scale in fixed building blocks of NX-3050 nodes with compute and storage regardless of what the actual requirement was.

Not long after that, Nutanix introduced the NX-1000 and NX-6000 series which had lower and higher CPU/RAM and storage capacity options which gave more flexibility, but still there were some use cases where Nutanix still had significant gaps.

In October 2013 I wrote a post titled “Scaling problems with traditional shared storage” which covers why simply adding shelves of SSD/HDD to a dual controller storage array does not scale an environment linearly, can significantly impact performance and add complexity.

At .NEXT 2015, Nutanix announced the ability to Scale Storage separately to Compute which allowed customers to scale capacity by adding similar to a shelf of drives like they could with their legacy SAN/NAS, but with the added advantage of having a storage controller (the Nutanix CVM) to add additional data services, performance and resiliency.

Storage only nodes are supported with any Hypervisor but the good news in they run on Nutanix’ Acropolis Hypervisor (AHV) which means no additional hypervisor licensing if you run VMware ESXi, and storage only nodes still support all the 1-click rolling upgrades so they add no additional management overhead.

Advantages of Storage Only Nodes:

  1. Ability to scale capacity seperate to CPU/RAM like a traditional disk shelf on a storage array
  2. Ability to start small and scale capacity if/when required, i.e.: No oversizing day 1
  3. No hypervisor licensing or additional management when scaling capacity
  4. Increased data services/resiliency/performance thanks to the Nutanix Controller VM (CVM)
  5. Ability to increase capacity for hot and cold data (i.e.: All Flash and Hybrid/Storage heavy)
  6. True Storage only nodes & the way data is distributed to them is unique to Nutanix

Example use cases for Storage Only Nodes

Example 1: Increasing capacity requirement:

MS Exchange Administrator: I’ve been told by the CEO to increase our mailbox limits from 1GB to 2GB but we don’t have enough capacity.

Nutanix: Let’s start small and add storage only nodes as the Nutanix cluster (storage pool) reaches 80% utilisation.

Example 2: Increasing flash capacity:

MS SQL DBA: We’re growing our mission critical database and now we’re hitting SATA for some day to day operations, we need more flash!

Nutanix: Let’s add some all flash storage only nodes.

Example 3: Increasing resiliency

CEO/CIO: We need to be able to tolerate failures and the infrastructure self heal but we have a secure facility which is difficult and time consuming to get access too, what can we do?

Nutanix: Let’s add some storage only nodes to ensure you have enough capacity (All Flash and/or Hybrid) to ensure sufficient capacity to tolerate “n” number of failures and rebuild the environment back to a fully resilient and performant state.

Example 4: Implementing Backup / Long Term Retention

CEO/CIO: We need to be able to keep 7 years of data for regulatory requirements and we need to be able to access it within 1hr.

Nutanix: We can either add storage only nodes to one or more existing clusters OR create a dedicated Backup/Retention cluster. Let’s start with enough capacity for Year 1, and then as capacity is required, add more storage only nodes as the cost per GB drops over time. Nutanix allows mixing of hardware generations so you’ll never be in a situation where you need to rip & replace.

Example 5: Supporting one or more Monster VMs

Server Administrator: We have one or more VMs with storage capacity requirements of 100TB each, but the largest Nutanix node we have only supports 20TB. What do we do?

Nutanix: The Distributed Storage Fabric (ADSF) allows a VMs data set to be distributed throughout a Nutanix cluster ensuring any storage requirement can be met. Adding storage only nodes will ensure sufficient capacity while adding resiliency/performance to all other VMs in the cluster. Cold data will be distributed throughout the cluster while frequently accessed data will remain local where possible within the local storage capacity on the node where the VM runs.

For more information on this use case see: What if my VMs storage exceeds the capacity of a Nutanix node?

Example 6: Performance for infrequently accessed data (cold data).

Server Administrator: We have always stored our cold data on SATA drives attached to our SAN because we have a lot of data and flash is expensive. One or twice a year we need to do a bulk read of our data for auditing/accounting purposes but it’s always been so slow. How can we solve this problem and give good performance while keeping costs down?

Nutanix: Hybrid Storage only nodes are a cost effective way to store cold data and combined with ADSF, Nutanix is able to deliver optimum read performance from SATA by reading from the replica (copy of data) with the lowest latency.

This means if a HDD or even a node is experiencing heavy load, ADSF will dynamically redirect Read I/O throughout the cluster to Deliver Increased Read Performance from SATA. This capability was released in 2015 and storage only nodes adding more spindles to a cluster is very complimentary to this capability.

Frequently asked questions (FAQ):

  1. How many storage only nodes can a single cluster support?
    1. There is no hard limit, typically cluster sizes are less than 64 nodes as it’s important to consider limiting the size of a single failure domain.
  2. How many Compute+Storage nodes are required to use Storage Only nodes?
    1. Two. This also allows N+1 failover for the nodes running VMs in the event a compute+storage node failed so VMs can be restarted. Technically, you can create a cluster with only storage only nodes.
  3. How does adding storage only node increase capacity for my monster VM?
    1. By distributing replicas of data throughout the cluster, thus freeing up local capacity for the running VM/s on the local node. Where a VMs storage requirement exceeds the local nodes capacity, storage only nodes add capacity and performance to the storage pool. Note: One VM even with only one monster vDisk can use the entire capacity of a Nutanix cluster without any special configuration.

Summary:

For many years Nutanix has supported and recommended the use of Storage only nodes to add capacity, performance and resiliency to Nutanix clusters.

Back to the Scalability, Resiliency and Performance Index.

Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 5 – Resiliency

When discussing resiliency, it is common to make the mistake of only looking at data resiliency and not considering resiliency of the storage controllers and the management components required to service the business applications.

Legacy technologies such as RAID and Hot Spare drives may in some cases provide high resiliency for data, however if they are backed by a dual controller type setups which cannot scale out and self heal, the data may be unavailable or performance/functionality severely degraded following even a single component failure. Infrastructure that is dependant on HW replacement to restore resiliency following a failure is fundamentally flawed as I have discussed in: Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

In addition if the management application layer is not resilient, then data layer high-availability/resiliency may be irrelevant as the business applications may not be functioning properly (i.e.: At normal speeds) or at all.

The Acropolis platform provides high resiliency for both the data and management layers at a configurable N+1 or N+2 level (Resiliency Factor 2 or 3) which can tolerate up to two concurrent node failures without losing access to Management or data. In saying that, with “Block Awareness”, an entire block (up to four nodes) can fail and the cluster still maintains full functionality. This puts the resiliency of data and management components on XCP up to N+4.

In addition, the larger the XCP cluster, the lower the impact of a node/controller/component failure. For a four node environment, N-1 is 25% impact whereas for an 8 node cluster N-1 is just a 12.5% impact. The larger the cluster the lower the impact of a controller/node failure. In contrast a dual controller SAN has a single controller failure, and in many cases the impact is 50% degradation and a subsequent failure would result in an outage. Nutanix XCP environments self heal so that even for an environment only configured for N-1, it is possible following a self heal than subsequent failures can be tolerated without causing high impact or outages.

In the event the Acropolis Master instance fails, full functionality will return to the environment after an election which completes within <30 seconds. This equates to management availability greater than “six nines” (99.9999%). Importantly, AHV has this management resiliency built-in; it requires zero configuration!

For more information see: Acropolis: Scalability

As for data availability, regardless of hypervisor the Nutanix Distributed Storage Fabric (DSF) maintains two or three copies of data/parity and in the event of a SSD/HDD or node failure, the configured RF is restored by all nodes within the cluster.

Data Resiliency

While we have just covered why resiliency of data is not the only important factor, it is still key. After all, if a solution which provides shared storage looses data, its not fit for purpose in any datacenter.

As data resiliency is such a foundation to the Nutanix Distributed Storage Fabric, the Data resiliency status is displayed on the Prism Home Screen. In the below screenshot we can see is that the ability to provide resiliency in both steady state and in the event of a failure (Rebuild Capacity) are both tracked.

In this example, all data in the cluster is compliant with the configured Resiliency Factor (RF2 or 3) and the cluster has at least N+1 available capacity to rebuild after the loss of a node.

dataresiliency1

To dive deeper into the resiliency status, simply click on the above box and it will expand to show more granular detail of the failures which can be tolerated.

The below screen shot shows things like Metadata, OpLog (Persistent Write Cache) and back end functions such as Zookeeper are also monitored and alerted when required.

resiliency2

In the event either of these is not in a normal or “Green” state, PRISM will alert the administrator. In the event the alert is the cause of a node failure, Prism automatically notifies Nutanix support (via Pulse) and dispatches the required part/s, although typically an XCP cluster will self-heal long before delivery of hardware even in the case of an aggressive Hardware Maintenance SLA such as 4hr Onsite.

This is yet another example of Nutanix not being dependent on Hardware (replacement) for resiliency.

Data Integrity

Acknowledging a Write I/O to a guest operating system should only occur once the data is written to persistent media because until this point, it is possible for data loss to occur even when storage is protected by battery backed cache and uninterruptible power supplies (UPS).

The only advantage to acknowledging writes before this has occurred is performance, but what good is performance when your data lacks integrity or is lost?

Another commonly overlooked requirement of any enterprise grade storage solution is the ability to detect and recover from Silent Data Corruption. Acropolis performs checksums in software for every write AND on every read. Importantly Nutanix is in no way dependent on the underlying hardware or any 3rd party software to maintain data integrity, all check summing and remediation (where required) is handled natively.

Pro tip: If a storage solution does not perform checksums on Write AND Read, DO NOT use it for production data.

In the event of Silent Data Corruption (which can impact any storage device from any vendor), the checksum will fail and the I/O will be serviced from another replica which is stored on a different node (and therefore physical SSD/HDD). If a checksum fails in an environment with Erasure Coding, EC-X recalculates the data the same way as if a HDD/SSD failed and services the I/O.

In the background, the Nutanix Distributed Storage Fabric will discard the corrupted data and restore the configured Resiliency Factor from the good replica or stripe where EC-X is used.

This process is completely transparent to the virtual machine and end user, but is a critical component of the XCP’s resiliency. The underlying Distributed Storage Fabric (DFS) also automatically protects all Acropolis management components, this is an example of one of the many advantages of the Acropolis architecture where all components are built together, not bolted on afterwards.

An Acropolis environment with a container configured with RF3 (Replication Factor 3) provides N+2 management availability. As a result, it would take an extraordinarily unlikely failure of three concurrent node failures before a management outage could potentially occur. Luckily XCP has an answer for this albeit unlikely scenario as well, Block Awareness is a capability where with 3 or more blocks the cluster can tolerate the failure of an entire block (up to 4 nodes) without causing data or management to go offline.

Part of the Acropolis story around resiliency goes back to the lack of complexity. Acropolis enables rolling 1-click upgrades and includes all functionality. There is no single point of failure; in the worst-case scenario if the node with Acropolis master fails, within 30 seconds the Master role will restart on a surviving node and initiate VMs to power on. Again this is in-built functionality, not additional or 3rd party solutions which need to be designed/installed & maintained.

The above points are largely functions of the XCP rather than AHV itself, so I thought I would highlight a AHV’s Load Balancing and failover capabilities.

Unlike traditional 3-tier infrastructure (i.e.: SAN/NAS) Nutanix solutions do not require multi-pathing as all I/O is serviced by the local controller. As a result, there is no multi-pathing policy to choose which removes another layer of complexity and potential point of failure.

However in the event of the local CVM being unavailable for any reason we need to service I/O for all the VMs on the node in the most efficient manner. AHV does this by redirecting I/O on a per vDisk level to a random remote stargate instance as shown below.

pervmpathfailover

AHV can do this because every vdisk is presented via iSCSI and is its own target/LUN which means it has its own TCP connection. What this means is a business critical application such as MS SQL / Exchange or Oracle with multiple vDisks will be serviced by multiple controllers concurrently.

As a result all VM I/O is load balanced across the entire Acropolis cluster which ensures no single CVM becomes a bottleneck and VMs enjoy excellent performance even in a failure or maintenance scenario.

For more information see: Acropolis Hypervisor (AHV) I/O Failover & Load Balancing

Summary:

  1. Out of the box self healing capabilities for:
    1. SSD/HDD/Node failure/s
    2. Acropolis and PRISM (Management layer)
  2. In-Built Data Integrity with software based checksums
  3. Ability to tolerate up to 4 concurrent node failures
  4. Management availability of >99.9999 (Six “Nines”)
  5. No dependency on Hardware for data or management resiliency

For more information see: Ensuring Data Integrity with Nutanix – Part 2 – Forced Unit Access (FUA) & Write Through

Back to the Index

Acropolis: Scalability

One of the major focuses for Nutanix both for our Distributed Storage Fabric (part of the Nutanix Xtreme Computing Platform or XCP) has been scalability with consistent performance.

Predictable scalability is critical to any distributed platform as it predictable scalability for the management layer.

This is one of the many strengths of the Acropolis management layer.

All components which are required to Configure, Manage, Monitor, Scale and Automate are fully distributed across all nodes within the cluster.

As a result, there is no single point of failure with the Nutanix/Acropolis management layer.

Lets take a look at a typical four node cluster:

Below we see four Controller VMs (CVMs) which service one node each. In the cluster we have an Acropolis Master along with multiple Acropolis Slave instances.

Acropolis4nodecluster1

In the event the Acropolis Master becomes unavailable for any reason, an election will take place and one of the Acropolis Slaves will be promoted to Master.

This can be achieved because Acropolis data is stored in a fully distributed Cassandra database which is protected by the Distributed Storage Fabric.

When an additional Nutanix node is added to the cluster, an Acropolis Slave is also added which allows the workload of managing the cluster to be distributed, therefore ensuring management never becomes a point of contention.Acropolis5NodeCluster

Things like performance monitoring, stats collection, Virtual Machine console proxy connections are just a few of the management tasks which are serviced by Master and Slave instances.

Another advantage of Acropolis is that the management layer never needs to be sized or scaled manually. There is no vApp/s , Database Server/s, Windows instances to deploy, install, configure, manage or license, therefore reducing cost and simplifying management of the environment.

Summary:

Acropolis Management is automatically scaled as nodes are added to the cluster, therefore increasing consistency , resiliency, performance and eliminating potential for architectural (sizing) errors which may impact manageability.

Note: For non-Acropolis deployments, PRISM is also scaled in the same manner as described above, however the scalability of Hypervisor management layers such as vCenter or SCVMM will need to be considered separately when not using Acropolis.