Nutanix Resiliency – Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)

As discussed in Part 1 for RF2 and Part 3 for RF3, a critical factor when discussing the resiliency of ADSF is the speed at which compliance with the configured Resiliency Factor can be restored in the event of a drive or node failure.

Let’s do a a quick recap of Part 1 and 3 and then look an an example of the performance of ADSF for a node failure when RF3 with Erasure Coding (EC-X) is used.

Because the rebuild operation (regardless of the configured resiliency factor or data reduction such as EC-X) is a fully distributed operation across all nodes and drives (i.e.: A Many to many operation), it’s both very fast and the workload per node is minimised to avoid bottlenecks and to reduce the impact to running workload.

Remember, the rebuild performance depends on many factors including the size of the cluster, the number/type of drives (e.g.: NVMe, SATA-SSD, DAS-SATA) as well as the CPU generation and network connectivity, but with this in mind I thought I would give an example with the following hardware.

The test bed is a 15 node cluster with a mix of almost 5 year old hardware including NX-6050 and NX-3050 nodes using Ivy Bridge 2560 Processors (Launched Q3, 2013), each with 6 x SATA-SSDs ranging in size and 2 x 10GB network connectivity.

Note: As Erasure Coding requires more computational overhead than RF2 or 3, faster processors would make a significant difference to the rebuild rate as they are used to calculate the parity whereas Resiliency Factor simply copies replicas (i.e.: No parity calculation required).

For this test, the cluster was configured with RF3 and Erasure Coding.

 

As with previous tests, the node failure is simulated by using the IPMI interface and using the “Power off server – immediate” option as shown below. This is the equivalent of pulling the power out of the back of a physical server.

IPMIPowerOff

 

Below is a screenshot from the Analysis tab in Nutanix HTML 5 PRISM GUI. It shows the storage pool throughput during the rebuild from the simulated node failure.

RF3ECXRebuildThroughput

As we can see, the chart shows the rebuild shows a peak of 7.24GBps and sustains over 5GBps throughput until completion. The task itself took just 47mins as shown below from the Chronos Master page which can be found at http://CVM_IP:2011.

NodeFailureTaskDuration

So in this example, we see that even with Erasure Coding (EC-X) enabled, Nutanix ADSF is able to rebuild at an extremely fast pace all while providing great capacity savings over RF3.

Summary:

  • Nutanix RF3 with or without Erasure Coding is vastly more resilient than RAID6 (or N+2) style architectures
  • ADSF performs continual disk scrubbing to detect and resolve underlying issues before they can cause data integrity issues
  • Rebuilds from drive or node failures are an efficient distributed operation using all drives and nodes in a cluster regardless of Resiliency Factor or data reduction configuration.
  • A recovery from a node failure (in this case, the equivalent of 6 concurrent SSD failures) with Erasure Coding can sustain over 5GBps even on 5yo hardware.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Nutanix | Scalability, Resiliency & Performance | Index

In my opinion there are three key areas for a solution such as the Nutanix platform to deliver consistent successful business outcomes.

They are of course Scalability, Resiliency and Performance.

ScalabilityResiliencyPerformanceIf a product is strong in just one of these areas and weak in the other two, put simply it’s very unlikely to be a minimally viable product.

Even two out of three isn’t enough, you need a product which delivers all three to a high standard if you want to have a successful business outcome.

For example, if you have a product which performs well, but doesn’t scale then a customer who has the potential of growth should not consider that product. If a product is very scalable but isn’t resilient is unlikely to be suitable for any production workloads and if a solution is resilient but not scalable, then again we have major constraints which make it unattractive in most if not all scenarios.

I’ve previously written about “Things to consider when choosing infrastructure” where I discussed avoiding silos and by combining workloads on a modern platform like Nutanix, the environment can be more performant, resilient and obviously scalable.

When you have a product or product/s which have all three of these qualities, you are very likely to be able to deliver a great business outcome.

I’ve previously compared “peak” performance with “real world” performance where I concluded among other things that “Peak performance is rarely a significant factor for a storage solution” and even though it’s very common when doing proof of concepts (POCs), “Do not waste time performing absurd testing of “Peak performance”.

The purpose of this series is to demonstrate how the Nutanix platform performs in all three areas and to show examples to help prospective customers make informed decisions.

These three areas are my primary focus day to day in my role as Principal Architect at Nutanix and I am working to ensure the core of the Nutanix platform continues to improve equally in all areas.

Readers of this series will notice that many of the posts will refer to back to each other and in many cases cover similar areas. This is deliberate as Scalability, Resiliency and Performance are tied together and essential for an enterprise grade solution.

This series will continue to be updated as new capabilities, enhancements etc are released.

I hope this series is informative and helps existing/prospective customers as well as industry analysts learn more about the evolving Nutanix platform.

Scalability

Part 1 – Storage Capacity
Part 2 – Compute (CPU/RAM)
Part 3 – Storage Performance for a single Virtual Machine
Part 4 – Storage Performance for Monster VMs with AHV!
Part 5 – Scaling Storage Performance for Physical Machines

More coming soon.

Resiliency

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Bonus: Sizing Nutanix for Resiliency with Design Brews

More coming soon.

Performance

Part 1 – Coming soon.

Referenced Posts:

Things to consider when choosing infrastructure
Enterprise Architecture & Avoiding tunnel vision
Peak performance vs Real World Performance

 

Nutanix Resiliency – Part 4 – Converting RF3 to Erasure Coding (EC-X)

At this stage of the series the cluster has been converted from RF2 to RF3, now the customer want’s to get some capacity efficiency without impacting resiliency. We will now look at the process and speed of enabling Erasure Coding (EC-X).

As discussed in , Nutanix EC-X is performed as a background task and only on Write Cold data meaning the configured RF is completed as normal and then as a post process EC-X is performed to ensure the write process is not potentially slowed by requiring numerous nodes within the cluster to participate in the initial write I/O.

For more detailed information on Nutanix EC-X implimentation, see my Nutanix – Erasure Coding (EC-X) Deep Dive.

What I/O will Nutanix Erasure coding (EC-X) take effect on? Good question, only Write cold data. As such, EC-X savings are realised over time as data becomes cold and as Nutanix ADSF curator scans to perform the background conversion as a low priority task.

How fast does ADSF convert RF2/3 to EC-X stripes? In short, we can do this as fast as the underlying drives can handle as shown in earlier parts of this series, but as EC-X is a space saving technology, it does not make sense to drive it as fast as say a node or drive failure operation as no data is at risk.

So under the covers the “ErasureCode:Encode” task is a Priority 3 as shown below, whereas a node/disk failure task is priority 1.

ECXPriority3Tasks

In cases where a customer wishes to encode their data as fast as possible this can be achieved in many way, including by enabling “curator maintenance mode” which basically removes soft limits and allows curator to drive as much background I/O as the cluster can handle, but this should only be used under supervision of Nutanix support so I won’t be sharing how to do this publicly.

Below is the throughput and storage pool usage after applying EC-X. The reason for the dip in the middle is because I enabled EC-X part way through a background curator scan which meant only some data was converted and then I manually trigged another curator full scan which completed the job at a very high and consistent rate of >5GBps.

ECXsavingsAndStoragePoolCapacityUsage

The below is the resulting capacity optimisation of 1.97:1 which is almost the theoretical maximum of 2:1. The reason EC-X was able to achieve near the maximum is there is no active workload on the cluster during this testing and as a result all the data is “write cold” and available for EC-X to stripe.

EC-XSavingsCompletedMultipleScans

So we have read the entire dataset of around 37.7TB and re-written the data in 4+2 EC-X stripes (4 data, 2 parity) which used around 18TB of space. This resulted in a saving of 18.54TB of physical storage.

If we do some simple math, we have 37.7TB of data read by ADSF and 18TB of EC-X stripes written for a total of approx 56TB of IO serviced by ADSF in less than 90mins.

Summary:

  • Nutanix EC-X has no additional write penalty compared to RF3 ensuring optimal write performance while providing up to 2x capacity efficiency at the same resiliency level.
  • ADSF uses it’s distributed storage fabric to efficiently stripe data
  • Background EC-X striping is a low priority task to minimise the impact on front end virtual machine IO
  • ADSF can sustain very high throughput during EC-X (background) operations
  • Using RF3 and EC-X ensures maximum resiliency and capacity efficiency resulting in up to 66% usable capacity of RAW storage (up to 2:1 efficiency over RF3)

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums