Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

In this series I’ve covered a wide range of topics showing how resilient the Nutanix platform is including being able to maintain data integrity for new writes during failures and the ability to rebuild after a failure in a timely manner to minimise the risk of a subsequent failure causing problems.

Despite all of this information, competing vendors still try to discredit the data integrity that Nutanix provides with claims such as “rebuild performance doesn’t matter if both copies of data are lost” which is an overly simple way to look at things since the chance of both copies of data being lost are extremely low, and of course Nutanix supports RF3 for customers who wish to store three copies of data for maximum resiliency.

So let’s get into Part 10 where we cover two critical topics, Disk Scrubbing and Checksums both of which you will learn help ensure RF2 and RF3 deployments are extremely resilient and highly unlikely to experience scenarios where data could be lost.

Let’s start with Checksums, what are they?

A checksum is a small amount of data created during a write operation which can later be read back to verify if the actual data is intact (i.e.: not corrupted).

Disk scrubbing on the other hand is a background task which periodically checks the data for consistency and if any errors are detected, disk scrubbing initiates an error correction process to fix single correctable errors.

Nutanix performs checksums for every write operation (RF2 or RF3) and verifies the checksum for every read operation! This means that data integrity is part of the IO path and is not and cannot be skipped or turned off.

Data integrity is the number 1 priority for any storage platform which is why Nutanix does not and will never provide an option to turn checksum off.

Since Nutanix performs a checksum on read, it means that data being accessed is always being checked and if any form of corruption has occurred, Nutanix AOS automatically retrieves the data from the RF copy and services the IO and concurrently corrects the error/corruption to ensure subsequent failures do not cause data loss.

The speed at which Nutanix can rebuild from a node/drive or extent (1MB block of data) failure is critical to maintaining data integrity.

But what about cold data?

Many environments have huge amounts of cold data, meaning it’s not being accessed frequently, so the checksum on read operation wont be checking that data as frequently if at all if the data is not accessed so how do we protect that data?

Simple, Disk Scrubbing.

For data which has not been accessed via front end read operations (i.e.: Reads from a VM/app), the Nutanix implementation of disk scrubbing checks cold data once per day.

The disk scrubbing task is performed concurrently across all drives in the cluster so the chance of multiple concurrent failure occurring such as a drive failure and a corrupted extent (1MB block of data) and for those two drives to be storing the same data is extremely low and that’s assuming you’re using RF2 (two copies of data).

The failures would need to be timed so perfectly that no read operation had occurred on that extent in the last 24hrs AND background disk scrubbing had not been performed on both copies of data AND Nutanix AOS predictive drive failure had not detected a drive degrading and already proactively re-protected the data.

Now assuming that scenario arose, the drive failure would also have to be storing the exact same extent as the corrupted data block, which even in a small 4 node cluster such as a NX3460, you have 24 drives so the probability is extremely low. The larger the cluster the lower the chance of this already unlikely scenario and the faster the cluster can rebuild as we’ve learned earlier in the series.

If you still feel it’s too high a risk and feel strongly all those events will line up perfectly, then deploy RF3 and you would now have to have all the stars align in addition to three concurrent failures to experience data loss.

For those of you who have deployed VSAN, disk scrubbing is only performed once a year AND VMware frequently recommend turning checksums off, including in their SAP HANA documentation which has subsequently been updated after I called them out because this is putting customers at a high and unnecessary risk of data loss.

Nutanix also has the ability to monitor the background disk scrubbing activity, the below screen shot shows the scan stats for Disk 126 which in this environment is a 2TB SATA drive at around 75% utilisation.

DiskScrubbingStats

Disk126

AOS ensures disk scrubbing occurs at a speed which guarantees the scrubbing of the entire disk regardless of size is finished every 24 hours, as per the above screenshot this scan has been running for 48158724ms or according to google, 13.3hrs with 556459ms (0.15hrs) ETA to complete.

scanduration

If you combine the distributed nature of the Acropolis Distributed Storage Fabric (ADSF) where data is dynamically spread evenly based on capacity and performance, a clusters ability to tolerate multiple concurrent drives failures per node, checksums being performed on every read/write operation, disk scrubbing being completed every day, proactive monitoring of hard drive/SSD health to in many cases re-protect data before a drive fails as well at the sheer speed that ADSF can rebuild data following failures, it’s easy to see why even using Resiliency Factor 2 (RF2) provides excellent resiliency.

Still not satisfied, change the Resiliency Factor to 3 (RF3) and you have yet another layer of protection and you get even more protection for the workloads you choose to enable RF3 for.

When considering your Resiliency Factor, or Failures to Tolerate in vSAN language, do not make the mistake of thinking two copies of data on Nutanix and vSAN is equivalent, Nutanix RF2 is vastly more resilient than FTT1 (2 copies) on vSAN which is why VMware frequently recommend FTT2 (3 copies of data). This actually makes sense because of the following reasons:

  1. vSAN is not a distributed storage fabric
  2. vSAN rebuild performance is slow and high impact
  3. vSAN disk scrubbing is only performed once a year
  4. VMware frequently recommend to turn checksums OFF (!!!)
  5. A single cache drive failure takes an entire disk group offline
  6. With all flash vSAN using compression and/or dedupe, a single drive brings down the entire disk group

Architecture matters, and for anyone who takes the time to investigate beyond the marketing slides of HCI and storage products will see that Nutanix ADSF is the clear leader especially when it comes to scalability, resiliency & data integrity.

Other companies/products are clear leaders in Marketecture (to be blunt, Bullshit like in-kernel being an advantage and 10:1 dedupe) but Nutanix leads  where it matters with a solid architecture which delivers real business outcomes.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Nutanix Scalability – Part 1 – Storage Capacity

It never ceases to amaze me that analysts as well as prospective/existing customers frequently are not aware of the storage scalability capabilities of the Nutanix platform.

When I joined back in 2013, a common complaint was that Nutanix had to scale in fixed building blocks of NX-3050 nodes with compute and storage regardless of what the actual requirement was.

Not long after that, Nutanix introduced the NX-1000 and NX-6000 series which had lower and higher CPU/RAM and storage capacity options which gave more flexibility, but still there were some use cases where Nutanix still had significant gaps.

In October 2013 I wrote a post titled “Scaling problems with traditional shared storage” which covers why simply adding shelves of SSD/HDD to a dual controller storage array does not scale an environment linearly, can significantly impact performance and add complexity.

At .NEXT 2015, Nutanix announced the ability to Scale Storage separately to Compute which allowed customers to scale capacity by adding similar to a shelf of drives like they could with their legacy SAN/NAS, but with the added advantage of having a storage controller (the Nutanix CVM) to add additional data services, performance and resiliency.

Storage only nodes are supported with any Hypervisor but the good news in they run on Nutanix’ Acropolis Hypervisor (AHV) which means no additional hypervisor licensing if you run VMware ESXi, and storage only nodes still support all the 1-click rolling upgrades so they add no additional management overhead.

Advantages of Storage Only Nodes:

  1. Ability to scale capacity seperate to CPU/RAM like a traditional disk shelf on a storage array
  2. Ability to start small and scale capacity if/when required, i.e.: No oversizing day 1
  3. No hypervisor licensing or additional management when scaling capacity
  4. Increased data services/resiliency/performance thanks to the Nutanix Controller VM (CVM)
  5. Ability to increase capacity for hot and cold data (i.e.: All Flash and Hybrid/Storage heavy)
  6. True Storage only nodes & the way data is distributed to them is unique to Nutanix

Example use cases for Storage Only Nodes

Example 1: Increasing capacity requirement:

MS Exchange Administrator: I’ve been told by the CEO to increase our mailbox limits from 1GB to 2GB but we don’t have enough capacity.

Nutanix: Let’s start small and add storage only nodes as the Nutanix cluster (storage pool) reaches 80% utilisation.

Example 2: Increasing flash capacity:

MS SQL DBA: We’re growing our mission critical database and now we’re hitting SATA for some day to day operations, we need more flash!

Nutanix: Let’s add some all flash storage only nodes.

Example 3: Increasing resiliency

CEO/CIO: We need to be able to tolerate failures and the infrastructure self heal but we have a secure facility which is difficult and time consuming to get access too, what can we do?

Nutanix: Let’s add some storage only nodes to ensure you have enough capacity (All Flash and/or Hybrid) to ensure sufficient capacity to tolerate “n” number of failures and rebuild the environment back to a fully resilient and performant state.

Example 4: Implementing Backup / Long Term Retention

CEO/CIO: We need to be able to keep 7 years of data for regulatory requirements and we need to be able to access it within 1hr.

Nutanix: We can either add storage only nodes to one or more existing clusters OR create a dedicated Backup/Retention cluster. Let’s start with enough capacity for Year 1, and then as capacity is required, add more storage only nodes as the cost per GB drops over time. Nutanix allows mixing of hardware generations so you’ll never be in a situation where you need to rip & replace.

Example 5: Supporting one or more Monster VMs

Server Administrator: We have one or more VMs with storage capacity requirements of 100TB each, but the largest Nutanix node we have only supports 20TB. What do we do?

Nutanix: The Distributed Storage Fabric (ADSF) allows a VMs data set to be distributed throughout a Nutanix cluster ensuring any storage requirement can be met. Adding storage only nodes will ensure sufficient capacity while adding resiliency/performance to all other VMs in the cluster. Cold data will be distributed throughout the cluster while frequently accessed data will remain local where possible within the local storage capacity on the node where the VM runs.

For more information on this use case see: What if my VMs storage exceeds the capacity of a Nutanix node?

Example 6: Performance for infrequently accessed data (cold data).

Server Administrator: We have always stored our cold data on SATA drives attached to our SAN because we have a lot of data and flash is expensive. One or twice a year we need to do a bulk read of our data for auditing/accounting purposes but it’s always been so slow. How can we solve this problem and give good performance while keeping costs down?

Nutanix: Hybrid Storage only nodes are a cost effective way to store cold data and combined with ADSF, Nutanix is able to deliver optimum read performance from SATA by reading from the replica (copy of data) with the lowest latency.

This means if a HDD or even a node is experiencing heavy load, ADSF will dynamically redirect Read I/O throughout the cluster to Deliver Increased Read Performance from SATA. This capability was released in 2015 and storage only nodes adding more spindles to a cluster is very complimentary to this capability.

Frequently asked questions (FAQ):

  1. How many storage only nodes can a single cluster support?
    1. There is no hard limit, typically cluster sizes are less than 64 nodes as it’s important to consider limiting the size of a single failure domain.
  2. How many Compute+Storage nodes are required to use Storage Only nodes?
    1. Two. This also allows N+1 failover for the nodes running VMs in the event a compute+storage node failed so VMs can be restarted. Technically, you can create a cluster with only storage only nodes.
  3. How does adding storage only node increase capacity for my monster VM?
    1. By distributing replicas of data throughout the cluster, thus freeing up local capacity for the running VM/s on the local node. Where a VMs storage requirement exceeds the local nodes capacity, storage only nodes add capacity and performance to the storage pool. Note: One VM even with only one monster vDisk can use the entire capacity of a Nutanix cluster without any special configuration.

Summary:

For many years Nutanix has supported and recommended the use of Storage only nodes to add capacity, performance and resiliency to Nutanix clusters.

Back to the Scalability, Resiliency and Performance Index.

Nutanix AOS 5.5 delivers 1M IOPS from a single VM, but what happens when you vMotion?

For many years Nutanix has been delivering excellent performance across multiple hypervisors as well as hardware platforms including the native NX series, OEMs (Dell XC & Lenovo HX) and more recently software only options with Cisco and HPE.

Recently I tweeted (below) showing how a single virtual machine can achieve 1 million 8k random read IOPS and >8GBps throughput on AHV, the next generation hypervisor.

While most of the response to this was positive, the usual negativity came from some competitors who tried to spread fear, uncertainty and doubt (FUD) about the performance including claims it was not sustainable during/after a live migration (vMotion) and that is does not demonstrate the performance of the IO path.

Let’s quickly cover of the IO path discussion of in-kernel vs a controller VM.

To test the IO path, in the case of Nutanix, via the Controller VM, you want to eliminate as many variables and bottlenecks as possible. This means a read/write test is not valid as writes are dependant on factors such as the network. As this was one a node using NVMe, the bottleneck would quickly become the network and not the path between the user VM and controller VM.

I’ve previously tweeted (below) showing an example of the throughput capabilities of SATA SSD, NVMe and 3DxPoint which clearly shows the network is the bottleneck with next generation flash.

I’ve also responded to 3rd party FUD about Nutanix Data locality with a post which goes in depth about Nutanix original & unique implementation of Data Locality which is how Nutanix minimises its dependancy on the network to deliver excellent performance.

So we are left with read IO to actually test and possibly stress the IO path between a User VM and software defined storage, be that in-kernel or in user space which is where the Nutanix CVM runs.

The tweet showing >1 million 8k random read IOPS and >8GBps throughput shows that the IO path of Nutanix is efficient enough to achieve this at just 110 micro (not milli) seconds.

The next question from those who try to discredit Nutanix and HCI in general is what happens after a vMotion?

Let me start by saying this is a valid question, but even if performance dropped during/after a vMotion is it even a major issue?

For business critical applications, it is common for vendors to recommend DRS should/must rules to prevent vMotion exception for in the event of maintenance or failure regardless of the infrastructure being traditional/legacy NAS/SAN or HCI.

With a NAS/SAN, the best case scenario is 100% remote IO where as with Nutanix this is the worse cast scenario. Let’s assume business as usual on Nutanix is 1M IOPS and during a vMotion and for a few mins after that performance dropped by 20%.

That would still be 800k IOPS which is higher than what most NAS/SAN solutions can delivery anyway.

But the fact is, Nutanix can sustain excellent performance during and after a vMotion as demonstrated by the video below which was recorded in real time. Hint: Watch the values in the putty session as these show the performance as measured at the guest level which is what ultimately matters.

Credit for the video goes to my friend and colleague Michael “Webscale” Webster (VCDX#66 & NPX#007).

The IO dropped below 1 million IOPS for approx 3 seconds during the vMotion with the lowest value recorded at 956k IOPS. I’d say an approx 10% drop for 3 seconds is pretty reasonable as the performance drop is caused by the migration stunning the VM and not by the underlying storage.

Over to our “friends” at the legacy storage vendors to repeat the same test on their biggest/baddest arrays.

Not impressed? Let’s see what 70/30 read/write workload performs!