Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

In this series I’ve covered a wide range of topics showing how resilient the Nutanix platform is including being able to maintain data integrity for new writes during failures and the ability to rebuild after a failure in a timely manner to minimise the risk of a subsequent failure causing problems.

Despite all of this information, competing vendors still try to discredit the data integrity that Nutanix provides with claims such as “rebuild performance doesn’t matter if both copies of data are lost” which is an overly simple way to look at things since the chance of both copies of data being lost are extremely low, and of course Nutanix supports RF3 for customers who wish to store three copies of data for maximum resiliency.

So let’s get into Part 10 where we cover two critical topics, Disk Scrubbing and Checksums both of which you will learn help ensure RF2 and RF3 deployments are extremely resilient and highly unlikely to experience scenarios where data could be lost.

Let’s start with Checksums, what are they?

A checksum is a small amount of data created during a write operation which can later be read back to verify if the actual data is intact (i.e.: not corrupted).

Disk scrubbing on the other hand is a background task which periodically checks the data for consistency and if any errors are detected, disk scrubbing initiates an error correction process to fix single correctable errors.

Nutanix performs checksums for every write operation (RF2 or RF3) and verifies the checksum for every read operation! This means that data integrity is part of the IO path and is not and cannot be skipped or turned off.

Data integrity is the number 1 priority for any storage platform which is why Nutanix does not and will never provide an option to turn checksum off.

Since Nutanix performs a checksum on read, it means that data being accessed is always being checked and if any form of corruption has occurred, Nutanix AOS automatically retrieves the data from the RF copy and services the IO and concurrently corrects the error/corruption to ensure subsequent failures do not cause data loss.

The speed at which Nutanix can rebuild from a node/drive or extent (1MB block of data) failure is critical to maintaining data integrity.

But what about cold data?

Many environments have huge amounts of cold data, meaning it’s not being accessed frequently, so the checksum on read operation wont be checking that data as frequently if at all if the data is not accessed so how do we protect that data?

Simple, Disk Scrubbing.

For data which has not been accessed via front end read operations (i.e.: Reads from a VM/app), the Nutanix implementation of disk scrubbing checks cold data once per day.

The disk scrubbing task is performed concurrently across all drives in the cluster so the chance of multiple concurrent failure occurring such as a drive failure and a corrupted extent (1MB block of data) and for those two drives to be storing the same data is extremely low and that’s assuming you’re using RF2 (two copies of data).

The failures would need to be timed so perfectly that no read operation had occurred on that extent in the last 24hrs AND background disk scrubbing had not been performed on both copies of data AND Nutanix AOS predictive drive failure had not detected a drive degrading and already proactively re-protected the data.

Now assuming that scenario arose, the drive failure would also have to be storing the exact same extent as the corrupted data block, which even in a small 4 node cluster such as a NX3460, you have 24 drives so the probability is extremely low. The larger the cluster the lower the chance of this already unlikely scenario and the faster the cluster can rebuild as we’ve learned earlier in the series.

If you still feel it’s too high a risk and feel strongly all those events will line up perfectly, then deploy RF3 and you would now have to have all the stars align in addition to three concurrent failures to experience data loss.

For those of you who have deployed VSAN, disk scrubbing is only performed once a year AND VMware frequently recommend turning checksums off, including in their SAP HANA documentation which has subsequently been updated after I called them out because this is putting customers at a high and unnecessary risk of data loss.

Nutanix also has the ability to monitor the background disk scrubbing activity, the below screen shot shows the scan stats for Disk 126 which in this environment is a 2TB SATA drive at around 75% utilisation.

DiskScrubbingStats

Disk126

AOS ensures disk scrubbing occurs at a speed which guarantees the scrubbing of the entire disk regardless of size is finished every 24 hours, as per the above screenshot this scan has been running for 48158724ms or according to google, 13.3hrs with 556459ms (0.15hrs) ETA to complete.

scanduration

If you combine the distributed nature of the Acropolis Distributed Storage Fabric (ADSF) where data is dynamically spread evenly based on capacity and performance, a clusters ability to tolerate multiple concurrent drives failures per node, checksums being performed on every read/write operation, disk scrubbing being completed every day, proactive monitoring of hard drive/SSD health to in many cases re-protect data before a drive fails as well at the sheer speed that ADSF can rebuild data following failures, it’s easy to see why even using Resiliency Factor 2 (RF2) provides excellent resiliency.

Still not satisfied, change the Resiliency Factor to 3 (RF3) and you have yet another layer of protection and you get even more protection for the workloads you choose to enable RF3 for.

When considering your Resiliency Factor, or Failures to Tolerate in vSAN language, do not make the mistake of thinking two copies of data on Nutanix and vSAN is equivalent, Nutanix RF2 is vastly more resilient than FTT1 (2 copies) on vSAN which is why VMware frequently recommend FTT2 (3 copies of data). This actually makes sense because of the following reasons:

  1. vSAN is not a distributed storage fabric
  2. vSAN rebuild performance is slow and high impact
  3. vSAN disk scrubbing is only performed once a year
  4. VMware frequently recommend to turn checksums OFF (!!!)
  5. A single cache drive failure takes an entire disk group offline
  6. With all flash vSAN using compression and/or dedupe, a single drive brings down the entire disk group

Architecture matters, and for anyone who takes the time to investigate beyond the marketing slides of HCI and storage products will see that Nutanix ADSF is the clear leader especially when it comes to scalability, resiliency & data integrity.

Other companies/products are clear leaders in Marketecture (to be blunt, Bullshit like in-kernel being an advantage and 10:1 dedupe) but Nutanix leads  where it matters with a solid architecture which delivers real business outcomes.

Index:
Part 1 – Node failure rebuild performance
Part 2 – Converting from RF2 to RF3
Part 3 – Node failure rebuild performance with RF3
Part 4 – Converting RF3 to Erasure Coding (EC-X)
Part 5 – Read I/O during CVM maintenance or failures
Part 6 – Write I/O during CVM maintenance or failures
Part 7 – Read & Write I/O during Hypervisor upgrades
Part 8 – Node failure rebuild performance with RF3 & Erasure Coding (EC-X)
Part 9 – Self healing
Part 10: Nutanix Resiliency – Part 10 – Disk Scrubbing / Checksums

Predictable & Scalable MS Exchange 2016 Performance on Nutanix with AHV

I’ve been doing some testing recently with Nutanix latest GA code (AOS 5.8) and I decided to do some quick MS Exchange Jetstress performance tests as part of a larger piece of work.

In short I wanted to check how well Exchange storage performance scaled so I performed three tests. I started with 4 threads, then increased to 8 and finally to 12 threads using Jetress with Exchange 2016 ESE database modules.

For this testing I disabled the Nutanix in memory read cache to ensure all read IO is serviced by the physical SSDs so the result is not artificially improved from cache.

I also disabled Compression, Erasure Coding and Deduplication as these also artificially improve performance due to Jetstress data being highly compressible & dedupable.

The hardware used was a NX-8150 with 6 x SSDs and Intel Broadwell processors. This is why the database size was only 1.7TB as that’s just below the total usable capacity of the node. The performance over larger database sizes remains the same when the metadata cache (in the Nutanix Controller VM) is sized for the desired working set size as shown by our ESRP certification.

The hypervisor is Acropolis Hypervisor (AHV) which is fully certified for Microsoft Windows under the MS SVVP programme as well as MS ESRP certified for MS Exchange.

So here is the result for 4 threads.

Jetstress2016_4Threads

5580 IOPS with just 4 threads is very good performance and is sufficient for at least five thousand mailboxes with hundreds of messages per day which is maximum recommended active users per Exchange MSR server.

The next question is: What’s the latency for the database reads and log writes? (These are two of the critical performance metrics for Jetstress Pass/Fail results)

Jetstress2016_4Threads_Latency

Here we can see log write latency average across all four log drives is below 1ms (0.99ms) and database read latency at 1.16ms.

Next up, here is the result for 8 threads.

Jetstress2016_8Threads

10147 IOPS with 8 threads is excellent performance and shows Nutanix easily has headroom for more than ten-thousand mailboxes with hundreds of messages per day which easily exceeds the requirements for the maximum recommended active users per Exchange MSR server.

Again let’s check out the latency, Here we can see log write latency average across all four log drives is still below 1ms (0.99ms) and database read latency at 1.29ms. That’s just 0.13ms higher latency for reads and exactly the same write latency while achieving almost DOUBLE the IOPS.

Jetstress2016_8Threads_Latency

Lastly here is the result for 12 threads.

Jetstress2016_12Threads

14351 IOPS with 12 threads proves how scalable the Nutanix platform is as this is almost a linear increase in IOPS.

Again let’s check out the latency, Here we can see log write latency average across all four log drives is still below 1ms (0.98ms) and database read latency at 1.42ms. That’s just 0.14ms higher latency for reads and slightly lower write latency while achieving almost linear improvement in IOPS.

Jetstress2016_12Threads_Latency

Summary:

Nutanix provides extremely high, predictable performance for even the most demanding MS Exchange environments.

 

Nutanix Scalability – Part 5 – Scaling Storage Performance for Physical Machines

Part 3 and Part 4 has taught us that Nutanix provides excellent scalability for Virtual Machines and provides ABS and Volume Group Load Balancer (VG LB) for niche workloads which may require more performance than a single node can provide.

Now that we’ve learned how to scale a Virtual machines performance, let’s see how the same rules apply to physical servers.

So you’ve got your physical server and a Nutanix cluster, now what?

As Part 3 and Part 4 explained, more virtual disks increase the storage performance for virtual machine, the same is true for physical servers using ABS.

Virtual disks will be presented to the physical server via iSCSI (ABS), for optimal performance you should have at least one virtual disk per node in your cluster. The reason for this is each vDisk is managed by a stargate (Nutanix IO engine) instance which runs in every Controller VM (CVM).

If you have a four node cluster, you need to use at least four virtual disks to utilise the four node cluster optimally. For an eight node cluster, eight or more virtual disks is required to ensure all CVMs (stargate instances) can actively provide a boost in performance.

The following tweet shows how the pathing increased from four on the four node cluster and when an additional fours node were added the pathing dynamically changed to use all eight nodes.

Therefore when using ABS for physical workloads, especially those high end database servers, I recommend using a minimum of 8 vDisks however if your cluster size is greater than 8, match the number of vDisks with the cluster size as your starting point.

If you have an 8 node cluster, you could for example use 32 vDisks and these will spread evenly across the nodes, resulting in four per stargate instance which is perfectly fine.

Using more vDisks than your current cluster size also means when additional nodes are added, ABS can dynamically load balance the vDisks across the new and existing nodes to automatically scale your performance.

Let’s cover the same MS Exchange and MS SQL examples covered for Virtual Machines in Parts 3 and 4 but now specifically for physical servers using ABS.

Let’s say we have an MS Exchange server with 20 databases, the performance requirements for each database is typically in the range of hundred of IOPS, in which case I would recommend one virtual disk (e.g.: VMDK) per database and another for the logs.

In the case of a large MS SQL server which may require tens or hundreds of thousands of IOPS to a single database, I recommend using multiple vDisks per database which involves Splitting SQL datafiles across multiple VMDKs to optimise VM physical server performance.

Sound familiar? The above two paragraphs are literally a copy/paste from Part 3 because the exact same rules apply to physical servers and virtual machines. Simple right!

Still need more performance?

Again, the exact same rules apply to physical servers with ABS as they do to virtual machines. In no particular order, as we’ve learned from Part 3 & 4:

  • Increase the vCPU of the Nutanix Controller VM (CVM)
  • Increase the vRAM of the Nutanix Controller VM (CVM)
  • Add storage only nodes

Can’t get much easier than that!

Summary:

From Parts 3, 4 and 5 we have learned that Nutanix provides the ability to scale the performance of individual servers, be it physical or virtual using the same simple strategies of adding virtual disks, storage only nodes or Controller VM (CVM) resources and how doing so increases performance to meet virtually (pun intended) any performance requirements.

Is there any reason you couldn’t confidently say Nutanix is doing 3 tier better than the SAN vendors? I’d love to hear if you have any corner cases.

Back to the Scalability, Resiliency and Performance Index.