Dare2Compare Part 6 : Nutanix data efficiency stats can’t be found

If you’ve not read Parts 1 through 5, we have already proven several claims by HPE Simplivity regarding Nutanix to be false, as well as explored the misleading way in which HPE SVT promote data efficiency.

We continue with Part 6 where we will discuss HPE’s claim that “Nutanix data efficiency stats are stealthier than a ninja”. (below)

While HPE’s claim is an attempt to create Fear, Uncertainty and Doubt (FUD), HPE are partially correct in that we (Nutanix) have done a very poor job of promoting the arguably market leading data efficiency that Nutanix provides.

In fact, several colleagues and I created a feature request to properly report in a clear and detailed way, the ADSF data efficiencies and I am pleased to say these changes were included as part of the recent AOS 5.1 release.

Now what Nutanix users see in PRISM “Storage” view is (as shown below):

  1. A Capacity optimization overview
  2. Data reduction ratio which is made up of deduplication, compression and erasure coding savings*.
  3. Data reduction savings which is a total GB/TB/PB value from data reduction
  4. An Overall Efficiency ratio which is a combination of Data Reduction, Cloning and Thin Provisioning

*Metadata copies/snapshops/pointers etc are not included in the deduplication value as they are not deduplication.

The resulting summary is very clear and easy to understand so customers can see what efficiencies are from data reduction, and which savings (which typically form by far the largest “efficiency”) come from Cloning and thin provisioning.

DataReductionSummary2

One major item which will be included in an upcoming release is zero suppression. Zero suppression is a capability which has been in Nutanix Distributed Storage Fabric since Day 1 and it avoids unnecessarily storing zeros, instead storing metadata which achieves the same outcome but is much higher performance and uses much less capacity.

Nutanix snapshots or pointer based copies (depending on how you refer to them) are also not included in the overall efficiency number, however these will also be included as a seperate line item in a future release as we aim to be very clear regarding what data efficiencies a customer is achieving with Nutanix.

Some vendors recommend Eager Zero Thick (EZT) VMDKs on vSphere, and then deduplicate the zeros which artificially increases the deduplication ratio. Nutanix does not do this as it’s inefficient to create more data to deduplicate when you can simply avoid writing the data in the first place. However we do plan to report the savings from Zero suppression as a seperate line item as it is a value our platform provides.

For a more detailed view, Nutanix customers can dive down into the storage,Diagram view where admins can view of each containers data efficiency breakdown (as shown below).

DetailedContainerView

As we can see, Nutanix is very transparent showing what data reduction features are enabled, what ratio is being achieved, the total, used, reserved and even Thick Provisioned storage with an effective free based on physical multiplied by data reduction ratio and an overall efficiency value.

Now that we’ve covered off how Nutanix measures and reports on data reduction/efficiency, I’d like to highlight a critical factor when discussing data reduction/efficiency and that is that data efficiency is totally dependant on the individual customers data. For the same dataset, the difference between vendors with the same capabilities, e.g.: Deduplication, Compression and Erasure Coding (EC-X) are unlikely to be vastly different (or better put, change a business outcome one way or another) despite what each vendor will say about their implementation of such technologies.

In short: The biggest factor in the achieved data reduction is not the vendor, it’s the customer data.

With that said, if you’re comparing HPE SVT and Nutanix, then there is a pretty major delta between the two products in terms of capabilities and that is because Nutanix supports Erasure Coding (EC-X) and HPE SVT does not.

As a result, Nutanix has a major advantage as Erasure Coding in the Nutanix Acropolis Distributed Storage Fabric (ADSF) is complimentory to both deduplication and compression.

Unlike Compression and Deduplication, Erasure Coding can provide savings (or another way to look at it would be lower data redundancy overheads) regardless of the data type.

So where Deduplication and Compression will get minimal/no savings for data such as Video files, Erasure Coding still provides savings so the delta between Nutanix and HPE SVT will only increase in Nutanix favour the less the customer data will dedupe and/or compress.

HPE SVT on the other hand has a RAID (RAID 6 being N-2 usable or RAID 60 being N-4 usable) overhead and on top of that, use replication (2 copies / 50% usable) for an usable capacity (of raw) of well below 50% depending on the number of drives per node.

Nutanix, using RF2 and EC-X provides between 50% (minimum) and 80% (maximum) usable capacity of RAW and with RF3 (N+2) between 33% (minimum) and 66% (maximum) usable excluding the benefits of compression and deduplication.

The next major factor in data efficiency ratios is how they are measured!

In Part 1 I have already covered how misleading HPE SVT’s 10:1 efficiency guarantee is, and this is a great example of why it can be difficult to compare apples/apples between vendors. Nutanix on the other hand does not measure data efficiency in the same misleading manner.

In Summary:

  1. Nutanix AOS 5.1 has comprehensive data reduction/efficiency reporting within the PRISM HTML GUI
  2. Nutanix data reduction capabilities exceed that of HPE SVT as both products have Dedupe and Compression, but Erasure Coding (EC-X) is only supported on Nutanix
  3. All data reduction capabilities on Nutanix are complimentory, so Dedupe , Compression and Erasure Coding can all work together to maximise savings.
  4. Erasure Coding provides data reduction even for data which is not compressible or dedupeable
  5. Nutanix data efficiency stats are easily visible in the PRISM GUI and are much more detailed than HPE SVT

Return to the Dare2Compare Index:

But wait, there’s more!

As far as data reduction results are concerned, they are all over twitter and a simple search comes up with many examples. The first one being my favorite. Not because of the data reduction ratio itself but because it shows one of the major values of a 100% software solution where a simple software upgrade (which is one-click rolling, non-disruptive) provided the customer a significantly higher data reduction ratio. So basically, the customer got more capacity for free!

Note: None of the below show the latest data efficiency reporting capabilities from AOS 5.1.

Here are a few other examples which I found using this Twitter search:

Calculating Actual Usable capacity? It’s not as simple as you might think! – Part 2 Nutanix

In Part 1, the example provided showed usable capacity for a SAN/NAS using a combination of RAID 10, RAID 5 and RAID 6 along with the various sizing considerations resulted in 35.68TB usable capacity or approx 1/3rd of the RAW 100TB.

In Part 2 we will discuss the misconception that Nutanix (a Hyper-converged platform) provides lower effective usable capacity compared to SAN or NAS solutions.

At a high level, Nutanix uses Replication Factor 2 (RF2) which has the same overhead as RAID 1 so straight away a lot of people jump to the conclusion that the usable capacity is less that a traditional SAN/NAS because *insert your favourite RAID level here* has less overhead.

Let’s say we have a Nutanix cluster with 100TB Raw storage using the most common node type, the NX3050.

Now let’s address the same points as we did in Part 1 for the SAN/NAS example:

So starting with the same 100TB RAW as we did for the SAN/NAS example and see where things end up on Nutanix.

1. Deducting hot spare drives

Nutanix does not use hot spare drives, data is balanced across all drives in the “Storage Pool”. To cater for failure, it is recommended to size for N+1 for Resiliency Factor 2 (RF2) deployments. If we we’re using NX3050 nodes (the most popular Nutanix node) then the overhead of N+1 would be ~4.8TB RAW.

100TB – N+1 Node (4.8TB RAW) = 95.2TB

2. RAID Overhead

Nutanix doesn’t use RAID, but the Replication Factor 2 has an overhead is 50% (the same as RAID10).

95.2TB – 50% (RF2) = 47.6TB remaining

3. Free Space on the platform required to ensure performance

For Nutanix all write I/O goes to either the Extent Storage or Oplog, both of which are housed on the SSD tier. All random writes are serviced by the Oplog until it reaches 95% capacity at which point the oplog is bypassed.

As such, performance remains high until 95% capacity. Therefore only 5% free capacity is required to ensure high performance.

47.6TB – 5% (Free space for performance) = 45.2TB

FYI: Nutanix Performance and Engineering team members including myself typically conduct benchmarks at greater than 90% cluster capacity.

4. Free space per LUN

Nutanix does not use LUNs. Nutanix presents containers to the hypervisor. All containers are thin provisioned and all containers can use all available space in the storage pool. Meaning free space only needs to be managed at the Storage Pool layer, not at each individual container.

As we have already taken into account the 5% free space there is no need to take another 5% of space therefore we remain at 45.2TB usable.

5. Free space per VMDK

As with physical servers and SAN/NAS environments, we don’t want our VMs drives running out of capacity, as a result it is common to size VMDKs well above what is strictly required to make capacity management (operational tasks) easier.

As mentioned in Part 1, I typically see architects recommending upwards of 10-20% free space per VMDK over and above what is required to account for unexpected growth, OS patching etc. This makes perfect sense for the same reason as we have free space per LUN because if space runs out for a VM, it’s another bad day for I.T.

For this example, I will assume the same 10% free space per VMDK as I did for SAN/NAS example, the difference with Nutanix is performance remains the same regardless of the VMDK being Thick or Thin provisioned, so with every VM Thin Provisioned, no capacity is required to be reserved for free space within VMDK files as it would be for traditional environments requiring Eager Zero Thick VMDKs for performance..

So we’re still at 45.2TB usable.

Now where are we at?

So far, the first 5 points are fairly easy to calculate.

Next we will look at various factors which further reduce usable capacity for SAN/NAS and see how they apply to Nutanix.

6. Silos for Performance

Nutanix does not require nor recommend silos being created for performance reasons. All VMs can reside in a single container therefore no capacity is unusable as a result of performance requirements.

As no silos are required for maximum performance, we are still at 45.2TB usable.

7. Silos of (or Fragmented) Usable Capacity

Nutanix does not configure usable capacity to containers, a container can use all the available storage in the underlying Storage Pool. Where multiple containers are provisioned, each container can see the total capacity of the storage pool while providing logical separation of the VMs within the containers. This avoids the issue of fragmented free capacity.

The diagram below shows 5 containers hosted by an example Nutanix cluster (Storage Pool) with 100TB total capacity, each container has a capacity of 100TB and 25TB free space in alignment with the underlying storage pool.

NutanixFreeSpace

In this case, when creating a new VM, or adding or expanding VMDKs for existing VMs, it does not matter which container we place the VM, as long as it is less than the 25TB available in the pool, it makes no difference to capacity.

This removes the requirement for complex capacity management, or using Storage DRS and Storage vMotion.

So we’re still at 45.2TB usable.

Other factors which reduce usable capacity?

8. LUN Provisioning Type

In many cases, especially when talking about high performance applications, storage vendors recommend using Thick Provisioned LUNs and as mentioned in Part 1, It’s anyone’s guess how much space is wasted as a result.

But with Nutanix, all containers are Thin Provisioned so no capacity is wasted on Thick Provisioning and performance is optimal

9. Wasted Capacity from using SSDs as Cache

Nutanix does not use SSDs as Cache! The SSD’s form part of the Extent Store which is for persistent data storage. The OpLog which is also on SSD is also persistent and not a “cache”. As such, no capacity is being reduced as a result of caching.

10. Snapshot Reserves

Nutanix does not use reserve capacity for snapshots. Snapshots simply use available capacity in the storage pool. If you don’t use snapshots, no space is wasted, if you do use snapshots, then the delta changes are stored. Simple as that.

Summary:

From the 100TB RAW factoring in what is a realistic Nutanix configuration including N+1 to tolerate a node failure and support the cluster being able to fully self heal the effective usable capacity is 45.2TB which is just under 50% of 100TB RAW.

This is a very simple configuration to manage from both a performance and capacity perspective, and one which is easily calculated and repeatable.

If the Resiliency Factor was 3 (which IMO is rarely if ever required) across the entire environment (which again would be extremely unusual as VMs which require RF3 can be configured in an RF3 container) then the usable capacity would be ~30TB which is only sightly below the SAN/NAS example and RF3 delivers higher resiliency.

In reality, >95% of workloads should be deployed on RF2, with a very small number of VMs possibly using RF3. In reality RF2 is extremely resilient and self healing so IMO RF3 is rarely required.

So in conclusion, Nutanix usable capacity is ~50% of RAW capacity, the difference between Nutanix and traditional SAN/NAS is you actually can use almost all the “usable” capacity and maintain optimal performance with little/no complexity.

Nutanix also has data reduction technologies such as Compression and De-duplication, along with intelligent cloning to increase the effective capacity of the storage pool.

While I believe Nutanix’ usable capacity today is excellent especially when considering how resilient RF2 is and comparing usable capacity to many products on the market, Nutanix has the advantage of not being constrained by legacy technologies such as RAID, so I’ll leave you with a little teaser:

Usable capacity will be improving significantly in upcoming releases of Nutanix Operating System. :)

Calculating Actual Usable capacity? It’s not as simple as you might think! – Part 1 SAN/NAS

Calculating the usable capacity for your next SAN/NAS is easy. Work out the number of drives you have, what RAID config your going to use and your done, right?!

Wrong! There are numerous factors which come into play to understand the ACTUAL or TRUE usable capacity of a SAN/NAS solution.

So let’s take an example of a traditional SAN/NAS using RAID and work out how much space we can actually use.

Note this is a simplified and generic example, which will vary from vendor to vendor.

Let’s say a SAN/NAS has 100 x 1TB drives (Note: The type of drive is not important for this example) and has the requirement to support mixed workloads such as MS SQL , MS Exchange and general server workloads.

As per vendor best practices, RAID 10 is used to maximize IOPS for SQL / Oracle and other storage intensive applications, RAID 5 is used for things like MS Exchange and RAID 6 (or DP) is used for general server workloads.

The vendor also recommends one hot spare drive per 2 disk shelves to ensure when drives fail, there are sufficient hot spares available.

So let’s start with 100TB RAW and see where things end up.

1. Deducting hot spare drives

So assuming 14 drives per shelf, that’s 7 drives (or 7TB RAW) dedicated to hot spares.

100TB – 7TB = 93TB

2. RAID Overhead

Let’s assume 20% of our workloads require RAID 10, so 20 drives are used. RAID 10 has a usable capacity of 50% so 20TB – 50% = 10TB

Next let’s say 40% of our workloads use RAID 5, so 40 drives broken up into 5 x RAID 5s each with 8 drives in a 7+1 Parity configuration. Therefore with 5 x RAID 5s volumes we loose 5 drives (5TB RAW) worth of capacity.

The final 40% of our workloads use RAID6 (or DP), so 40 drives broken up into 5 x RAID 6s each with 8 drives in a 6+2 Parity configuration therefore with 5 x RAID 6s we loose 10 drives (10TB RAW) worth of capacity.

93TB – 10TB (RAID 10) – 5TB (RAID5) – 10TB (RAID6) = 68TB remaining

3. Free Space on the platform required to ensure performance

For most traditional storage solutions, the vendors recommend ensuring a specific percentage of free space to ensure performance remains consistent.

For some vendors this is 20% and others say around 30%.

For this example, I will assume best case scenario of 20%.

68TB – 20% (Free space for performance) = 54.4TB

4. Free space per LUN

Vendors typically recommend having between 10-20% free space per LUN to account for unexpected growth, VM level snapshots etc. This makes perfect sense as if a LUN runs out of space, its a bad day for the I.T dept.

For this example, I will assume only 10% free space per LUN but it could easily be 20% further reducing usable capacity.

54.4TB – 10% (Free space per LUN) = 48.96TB

5. Free space per VMDK

As with physical servers, we don’t want our VMs drives running out of capacity, as a result it is common to size VMDKs well above what is strictly required to make capacity management (operational tasks) easier.

I typically see architects recommending upwards of 10-20% free space per VMDK over and above what is required to account for unexpected growth, OS patching etc. This makes perfect sense for the same reason as we have free space per LUN because if space runs out for a VM, it’s another bad day for I.T.

For this example, I will assume only 10% free space per VMDK.

48.96TB – 10% (Free space per VMDK) = 44.064TB

Now where are we at?

So far, the first 5 points are fairly easy to calculate and if you agree or not with the specific examples or percentage deductions, I’d suggest few would disagree these are factors which reduce usable disk space for traditional SAN/NAS deployments.

Next we will look at various factors which further reduce usable capacity. Each of these factors will vary from customer to customer, which further complicates the sizing exersize and results in lower usable capacity than what you may believe.

6. Silos for Performance

In this example, we have assumed only 20% of our drives are configured for high I/O with RAID 10, but in many cases the drives required for performance could be a much higher percentage.

Now to get the IOPS required for these storage intensive applications, its common to see the capacity utilization of the LUNs be much lower than the usable capacity because the storage is IOPS constrained, not capacity.

This leads to Silos of drives with low utilization, where the remaining capacity cannot (or at least should not) be shared with other VMs as this would likely impact the performance of the IO intensive VMs.

So for example, if our RAID 10 LUNs have 50% free space (which I personally have found to be common) then we’re effectively wasting 5TB (50% of the RAID 10s 10TB usable).

44.064TB – 10% (Wasted Capacity for Performance Silos) = 39.65TB

7. Silos of (or Fragmented) Usable Capacity

In this example, we have assumed 40% of our drives are configured for RAID 5 and the remaining 40% for RAID 6 (DP) to suit the different workloads in this environment, as a result we have 2 “Silos” of usable capacity.

In this post I have described 5 x 8 drive RAID 5s and 5 x 8 Drive RAID 6 volumes. The below diagram is an example of what an environment in this configuration may have with regards to free space per LUN.

LUNsFreeSpace

So we can see the average free space per LUN is 20%, but it varies from one LUN having only 5% free space and another having 35%.

In this case, when creating a new VM, or adding or expanding VMDKs for existing VMs, we have a situation where we will need to be careful about where we place a new VMDK from a capacity perspective but keeping in mind performance as well.

Now not all VMs or VMDKs are the same size, so if a new VMDK needs to be 500GB even though the environment may have well in excess of 500GB available, the fact that the free space is fragmented across multiple LUNs means we cannot create the new VMDK without first migrating VMs across the LUNs.

Now Storage DRS can do a reasonable job of this, but that takes time and impacts performance (during the Storage vMotion) and depending on the size of the VMs in the environment may not always be able to solve the issue.

Best case scenario, in my experience is at least 10% of capacity is wasted simply because of the fact the drives are carved up into RAID groups and VMs don’t fit within the inflexible LUNs.

39.65TB – 10% (Wasted Capacity due to de-fragmented free space) = 35.68TB

Usable space so far from 100TB RAW is only 35.68TB or approx 1/3rd!

Other factors which reduce usable capacity?

8. LUN Provisioning Type

In many cases, especially when talking about high performance applications, storage vendors recommend using Thick Provisioned LUNs.

As a result limited or no overcommitment can be achieved which reduces the usable capacity due to the thick provisioning.

It’s anyone’s guess how much space is wasted as a result.

Summary:

From the 100TB RAW factoring in what I believe to be realistic configuration of RAID, the impact of free space requirements, thick provisioning and capacity fragmentation we end up with only 35.68TB usable capacity or approx 1/3rd of the RAW.

Now most vendors provide some form of data reduction such as compression/de-duplication, others recommend some thin provisioning and these may increase the effective capacity, but this example shows its not as simple as you think to size for SAN/NAS storage and the overhead of RAID is only one of the many factors which impact the effective usable capacity.

In Part 2, I will run through a similar example for Nutanix usable capacity.