Nutanix X-Ray Benchmarking tool – Extended Node Failure Scenario

In the first part of this series, I introduced Nutanix X-Ray benchmarking tool which has been designed very differently to traditional benchmarking tools as the performance of the app is the control and the variable is the platform,not the other way around.

In the second part, I showed how Nutanix AHV & AOS could maintain the performance while utilising snapshots to achieve the type of recovery point objective (RPO) that is expected in production environments, especially with business critical workloads whereas a leading hypervisor and SDS platform could not.

In this part, I will cover the Extended Node Failure Scenario in X-Ray and again compare Nutanix AOS/AHV and a leading hypervisor and SDS platform in another real world scenario.

Let’s start by reviewing what the description of the X-ray Extended node failure scenario.

XrayExtendedNodeFailureScenario

I really like that X-ray has a scenario which shows a simulated node failure as this is bound to happen regardless of the platform you choose, and with hyperconverged platforms the impact of a node failure is arguably higher than traditional 3-tier as the nodes contain some data which needs to be recovered.

As such, it is critical before choosing a HCI platform to understand how it behaves in a failure scenario which is exactly what this scenario demonstrates.

XrayNodeFailureComparison

Here we can see the impact on the performance of the surviving VMs following the power being disconnected via the out of band management interface.

The Nutanix AOS/AHV platform continues to run at a very steady rate, virtually without impact to the VMs. On the other hand we see that after 1 hour the other platform has a high impact with significant degradation.

This clearly shows the Acropolis Distributed Storage Fabric (ADSF) to be a superior platform from a resiliency perspective, which should be a primary consideration when choosing a platform for any production environment.

Back in 2014, I highlighted the Problems with RAID and Object Based Storage for data protection and in a follow up post I discussed how Nutanix Acropolis Distributed Storage Fabric (ADSF) compares with traditional SAN/NAS RAID and hyper-converged solutions using Object storage for data protection.

The above results clearly demonstrate the problems I discussed back in 2014 are still applicable to even the most recent versions of a leading hypervisor and SDS platform. This is because the problem is the underlying architecture and bolting on new features is at best masking the constraints of the original architectural decision which has proven to be significantly flawed.

This scenario clearly demonstrates the criticality of looking beyond peak performance numbers and conducting a thorough evaluation of a platform prior to purchase as well as comprehensive operational verification prior to moving any platform into production.

Related Articles:

Nutanix X-Ray Benchmarking tool Part 1 – Introduction

Nutanix X-Ray Benchmarking tool Part 2 -Snapshot Impact Scenario

It’s 2017, let’s review Thick vs Thin Provisioning

For a long time, it has been widely considered that thick provisioning is required to achieve maximum storage performance and for many years this was a good rule of thumb.

Before we get into details, what are Thick and Thin provisioning?

Thick provisioning is where storage allocated to a LUN, NFS mount or Virtual Disk (such as a VMDK in ESXi, VHDX in Hyper-V or vDisk in AHV) is zeroed out and/or fully reserved regardless of how much capacity is actually used.

Thick provisioning avoids a storage subsystem from having to zero out a block before writing new data which is one of the reasons higher performance could be achieved on many storage platforms.

Thin provisioning on the other hand is where storage allocated to a LUN or Virtual Disk is zeroed as data is written and allows physical capacity to be overcommitted.

The advantages of Thick provisioning included easier capacity management, or simply put a “What you see is what you get” as well as maximum performance on most platforms. But by maximum performance, even on older storage platforms the advantage was rarely significant as people would claim.

VMware conducted a Performance Study of VMware vStorage Thin Provisioning back in the ESXi 4.0 days (~2009) which I will briefly summarise.

On page 6 of the performance study the following graph shows the different in performance between Thin and Thick VMDKs during zeroing and post-zeroing.

As you can see the performance is almost identical.

The disadvantages though were and remain significant to this day which include an inability to overcommit storage, meaning physical free space has to be maintained at multiple layers such as RAID group, LUN, Virtual Disk layers, leading to inefficiency.

The advantages of Thin provisioning include the ability to overcommit storage which results in more flexibility when sizing LUNs & Virtual Disks and less wasted space. The only real downsides were potentially increased capacity management complexity and lower performance.

I have previously written two example architectural decisions regarding using “Thin on Thin“, meaning thin provisioned virtual disks on a thin provisioned LUN or NFS mount as well as “Thin on Thick” meaning thin provisioned virtual disks on a thick provisioned LUN or NFS mount. These two examples cover off many of the traditional pros and cons between thick and think, so I won’t repeat myself here.

I never wrote an example design decision for Thick on Thick, but this was common practice when provisioning storage was time consuming, difficult and involved lengthly delays to engage subject matter experts.

In early 2015, I wrote a two part blog series where I explained it’s not as simple as you might think to calculate usable capacity where I compared SAN/NAS verses Nutanix. In part 1, I highlight that the LUN Provisioning Type is one area which can greatly impact the usable capacity of a traditional storage platform.

But fast forward into the era of hyper-converged platforms like Nutanix and some modern storage arrays and the major downsides of thin provisioning, being complexity of capacity management and reduced performance have not only been reduced, but at least in the case of Nutanix, have been eliminated all together.

Let’s address Capacity management w/ Nutanix:

Storage utilisation only needs to be monitored in ONE place, the storage summary which lives on the home screen of the Nutanix HTML 5 UI.

NutanixStorageSummary

No matter how many nodes in your cluster, number of containers (which translate to datastores in a VMware environment), virtual machines & virtual disks or physical servers connecting via ABS, this is the only place you need to monitor capacity.

There are no RAID groups, Disk Groups, Aggregates, LUNs etc where capacity needs to be managed. All nodes in a cluster contributed to the capacity of the cluster and even when one or more virtual machines use more capacity than a the node they run on, Nutanix Acropolis Distributed Storage Fabric (ADSF) takes care of it.

So issue #1, Capacity management, is solved. Now it’s onto the issue of performance.

Thin Provisioning Performance w/ Nutanix:

When running ESXi, Nutanix runs NFS datastores and supports thick provisioning via the VAAI-NAS Space reservation primitive as discussed in this post. This allows the creation of thick provisioned (Eager Zero or Lazy Zero Thick) VMDKs when traditionally NFS datastores did not support it.

However this was only required for Oracle RAC and VMware Fault Tolerance and was not a performance requirement.

However from a performance perspective, Thin provisioning actually outperforms thick on intelligent storage such as Nutanix. In the specific case of Nutanix, random write I/O is serviced by the fastest tier available (e.g.: SSD) and via the operations log (OPLOG) which takes the random writes commits them to persistent media, and then coalesces them into sequential IO to then commit to SSD before tiering it off to lower cost storage in the case of hybrid nodes.

This means the write penalty for overwriting or zeroing blocks before writing new I/O is eliminated.

In fact if you configure thick provisioned virtual disks, as the zeros (or whitespace) is being written by the hypervisor, the Nutanix storage fabric acknowledges every I/O and discards the zeros in favour of storing metadata and simply reserving the capacity. In simple terms, this just means Nutanix has to acknowledge a whole bunch of nothing and the thick provisioning is achieve with a simple reservation as opposed to zeroing out many GBs or TBs of storage.

This means thick provisioning is actually lower performance than thin provisioning on Nutanix.

With modern, intelligent storage, there is limited if any benefits to using thick provisioning, the only example I can think of is to artificially inflate the deduplication ratio as thick provisioned virtual disks tend to have a lot of zeros all of which dedupe. I wrote an article titled: “Deduplication ratios – What should be included in the reported ratio?” which covers off this point in detail but in short, don’t create unnessasary data (in this case, zeros) just to inflate your dedupe ratio, it just wastes storage controller resources and achieves no additional benefits.

The following is a comprehensive list of the real world advantages of using thick provisioning on Nutanix.

This space is intentionally left blank

Summary:

For the best efficiency and performance when deploying virtual machines or storage for physical servers via ABS on Nutanix, use thin provisioning!

Dare2Compare Part 6 : Nutanix data efficiency stats can’t be found

If you’ve not read Parts 1 through 5, we have already proven several claims by HPE Simplivity regarding Nutanix to be false, as well as explored the misleading way in which HPE SVT promote data efficiency.

We continue with Part 6 where we will discuss HPE’s claim that “Nutanix data efficiency stats are stealthier than a ninja”. (below)

While HPE’s claim is an attempt to create Fear, Uncertainty and Doubt (FUD), HPE are partially correct in that we (Nutanix) have done a very poor job of promoting the arguably market leading data efficiency that Nutanix provides.

In fact, several colleagues and I created a feature request to properly report in a clear and detailed way, the ADSF data efficiencies and I am pleased to say these changes were included as part of the recent AOS 5.1 release.

Now what Nutanix users see in PRISM “Storage” view is (as shown below):

  1. A Capacity optimization overview
  2. Data reduction ratio which is made up of deduplication, compression and erasure coding savings*.
  3. Data reduction savings which is a total GB/TB/PB value from data reduction
  4. An Overall Efficiency ratio which is a combination of Data Reduction, Cloning and Thin Provisioning

*Metadata copies/snapshops/pointers etc are not included in the deduplication value as they are not deduplication.

The resulting summary is very clear and easy to understand so customers can see what efficiencies are from data reduction, and which savings (which typically form by far the largest “efficiency”) come from Cloning and thin provisioning.

DataReductionSummary2

One major item which will be included in an upcoming release is zero suppression. Zero suppression is a capability which has been in Nutanix Distributed Storage Fabric since Day 1 and it avoids unnecessarily storing zeros, instead storing metadata which achieves the same outcome but is much higher performance and uses much less capacity.

Nutanix snapshots or pointer based copies (depending on how you refer to them) are also not included in the overall efficiency number, however these will also be included as a seperate line item in a future release as we aim to be very clear regarding what data efficiencies a customer is achieving with Nutanix.

Some vendors recommend Eager Zero Thick (EZT) VMDKs on vSphere, and then deduplicate the zeros which artificially increases the deduplication ratio. Nutanix does not do this as it’s inefficient to create more data to deduplicate when you can simply avoid writing the data in the first place. However we do plan to report the savings from Zero suppression as a seperate line item as it is a value our platform provides.

For a more detailed view, Nutanix customers can dive down into the storage,Diagram view where admins can view of each containers data efficiency breakdown (as shown below).

DetailedContainerView

As we can see, Nutanix is very transparent showing what data reduction features are enabled, what ratio is being achieved, the total, used, reserved and even Thick Provisioned storage with an effective free based on physical multiplied by data reduction ratio and an overall efficiency value.

Now that we’ve covered off how Nutanix measures and reports on data reduction/efficiency, I’d like to highlight a critical factor when discussing data reduction/efficiency and that is that data efficiency is totally dependant on the individual customers data. For the same dataset, the difference between vendors with the same capabilities, e.g.: Deduplication, Compression and Erasure Coding (EC-X) are unlikely to be vastly different (or better put, change a business outcome one way or another) despite what each vendor will say about their implementation of such technologies.

In short: The biggest factor in the achieved data reduction is not the vendor, it’s the customer data.

With that said, if you’re comparing HPE SVT and Nutanix, then there is a pretty major delta between the two products in terms of capabilities and that is because Nutanix supports Erasure Coding (EC-X) and HPE SVT does not.

As a result, Nutanix has a major advantage as Erasure Coding in the Nutanix Acropolis Distributed Storage Fabric (ADSF) is complimentory to both deduplication and compression.

Unlike Compression and Deduplication, Erasure Coding can provide savings (or another way to look at it would be lower data redundancy overheads) regardless of the data type.

So where Deduplication and Compression will get minimal/no savings for data such as Video files, Erasure Coding still provides savings so the delta between Nutanix and HPE SVT will only increase in Nutanix favour the less the customer data will dedupe and/or compress.

HPE SVT on the other hand has a RAID (RAID 6 being N-2 usable or RAID 60 being N-4 usable) overhead and on top of that, use replication (2 copies / 50% usable) for an usable capacity (of raw) of well below 50% depending on the number of drives per node.

Nutanix, using RF2 and EC-X provides between 50% (minimum) and 80% (maximum) usable capacity of RAW and with RF3 (N+2) between 33% (minimum) and 66% (maximum) usable excluding the benefits of compression and deduplication.

The next major factor in data efficiency ratios is how they are measured!

In Part 1 I have already covered how misleading HPE SVT’s 10:1 efficiency guarantee is, and this is a great example of why it can be difficult to compare apples/apples between vendors. Nutanix on the other hand does not measure data efficiency in the same misleading manner.

In Summary:

  1. Nutanix AOS 5.1 has comprehensive data reduction/efficiency reporting within the PRISM HTML GUI
  2. Nutanix data reduction capabilities exceed that of HPE SVT as both products have Dedupe and Compression, but Erasure Coding (EC-X) is only supported on Nutanix
  3. All data reduction capabilities on Nutanix are complimentory, so Dedupe , Compression and Erasure Coding can all work together to maximise savings.
  4. Erasure Coding provides data reduction even for data which is not compressible or dedupeable
  5. Nutanix data efficiency stats are easily visible in the PRISM GUI and are much more detailed than HPE SVT

Return to the Dare2Compare Index:

But wait, there’s more!

As far as data reduction results are concerned, they are all over twitter and a simple search comes up with many examples. The first one being my favorite. Not because of the data reduction ratio itself but because it shows one of the major values of a 100% software solution where a simple software upgrade (which is one-click rolling, non-disruptive) provided the customer a significantly higher data reduction ratio. So basically, the customer got more capacity for free!

Note: None of the below show the latest data efficiency reporting capabilities from AOS 5.1.

Here are a few other examples which I found using this Twitter search: