Nutanix NC2 – Direct to Cloud Value – Part 1

Nutanix “Cloud Clusters” a.k.a “NC2” was designed to enable customers to quickly and easily migrate from on premises environments into public cloud providers such as Amazon EC2 and Microsoft Azure and benefit from these offerings long list of business, architectural and technical advantages.

Two of the advantages which stand out to me are:

  1. Well understood standard architecture
  2. Global availability

The well understood architecture of both the Amazon and Azure offerings is incredibly valuable as organisations are largely protected from the underestimated cost & impact of “tribal knowledge” being lost with inevitable employee turnover.

The standard architecture is also highly valuable as both Microsoft and Amazon have extensive training and certification programmes to ensure customers can validate skills of potential employees and enable their existing staff.

The standard architecture also reduces many of the risks & cost of bespoke or custom designed environments where it’s almost impossible for customers (and even many vendors) to match the amount of architectural and engineering rigour as large companies such as Microsoft and Amazon can invest due to their incredible scale.

Now looking at the global availability of resources from both Amazon and Azure, this is extremely attractive as it enables customers to potentially deploy anywhere in the world and in a timely manner and reduce the risk of supply chain issues delaying projects and/or restoring resiliency to production environments after hardware failure/s.

Lets switch gears and look at NC2 and where it fits in.

For a long time now, I’ve be championing Nutanix HCI as the “standard platform for all workloads” as it allows customers to benefit from a well understood architecture which simplifies the traditional datacenter. This reduces the risks & cost of bespoke or custom designed environments and Nutanix training programmes ensure existing and future staff have or can develop the required skills.

However the problem with any on premises focused product/s are they’re all constrained by commercial challenges such as CAPEX, supply chain as well as organisational challenges such as change requests/approvals/windows and ultimately all of these negatively impact the time to value no matter how simple the deployment it’s once the equipment is delivered.

However with the introduction of NC2, customers benefit from the best of all worlds being Nutanix NC2 as the standard platform for all workloads which can be spanned from new/existing on premises deployments to public clouds providers such as Amazon and Azure.

Leveraging NC2 on Amazon and Azure effectively eliminates the commercial challenges (CAPEX & supply chain) and ensures the fastest possible time to value with new NC2 environments being able to be deployed in under 60 minutes. This partnership also enables customers have a true global reach available at their fingertips.

The ability to scale resources which provide increased performance & capacity to all workloads cluster wide) in minutes is also extremely valuable.

Summary

Nutanix NC2 provides a highly complimentary offering to AWS and Azure which enables customers to enjoy a simple, standard platform for all workloads across private and leveraging multiple public cloud providers and even operate across and migrate/failover between providers.

NC2 can also deliver higher performance, increased resiliency (business continuity) with lower risk, typically a lower total cost of ownership (TCO) while providing a genuine and relatively simple public cloud provider exit strategy.

In Part 2 we will dive into a detailed cost comparison of NC2.

A TCO Analysis of Pure FlashStack & Nutanix Enterprise Cloud

In helping to prepare this TCO with Steve Kaplan here at Nutanix, I’ll be honest and say I was a little surprised at the results.

The Nutanix Enterprise Cloud platform is the leading solution in the HCI space and it while it is aimed to deliver great business outcomes and minimise CAPEX,OPEX and TCO, the platform is not designed to be “cheap”.

Nutanix is more like the top of the range model from a car manufacturer with different customer requirements. Nutanix has options ranging from high end business critical application deployments to lower end products for ROBO, such as Nutanix Xpress model.

Steve and I agreed that our TCO report needed to give the benefit of the doubt to Pure Storage as we do not claim to be experts in their specific storage technology. We also decided that as experts in Nutanix Enterprise Cloud platform and employees of Nutanix, that we should minimize the potential for our biases towards Nutanix to come into play.

The way we tried to achieve the most unbiased view possible is to give no benefit of the doubt to the Nutanix Enterprise Cloud solution. While we both know the value that many of the Nutanix capabilities have (such as data reduction), we excluded these benefits and used configurations which could be argued at excessive/unnecessary such as vSphere or RF3 for data protection:

  1. No data reduction is assumed (Compression or Deduplication)
  2. No advantage for data locality in terms of reduced networking requirements or increased performance
  3. Only 20K IOPS @ 32K IO Size per All Flash Node
  4. Resiliency Factor 3 (RF3) for dual parity data protection which is the least capacity efficient configuration and therefore more hardware requirements.
  5. No Erasure Coding (EC-X) meaning higher overheads for data protection.
  6. The CVM is measured as an overhead with no performance advantage assumed (e.g.: Lower latency, Higher CPU efficiency from low latency, Data Locality etc)
  7. Using vSphere which means Nutanix cannot take advantage of AHV Turbo Mode for higher performance & lower overheads

On the other hand, the benefit of the doubt has been given to Pure Storage at every opportunity in this comparison including the following:

  1. 4:1 data reduction efficiency as claimed
  2. Only 2 x 10GB NICs required for VM and Storage traffic
  3. No dedicated FC switches or cables (same as Nutanix)
  4. 100% of claimed performance (IOPS capability) for M20,M50 and M70 models
  5. Zero cost for the project/change control/hands on work to swap Controllers as the solution scales
  6. IOPS based on the Pure Storage claimed average I/O size of 32K for all IO calculations

We invited DeepStorage and Vaughn Stewart of Pure Storage to discuss the TCO and help validate our assumptions, pricing, sizing and other details. Both parties declined.

Feedback/corrections regarding the Pure Storage sponsored Technical Report by DeepStorage was sent via Email, DeepStorage declined to discuss the issues and the report remains online with many factual errors and an array (pun intended) of misleading statements which I covered in detail in my Response to: DeepStorage.net Exploring the true cost of Converged vs Hyperconverged Infrastructure

It’s important to note that the Nutanix TCO report is based on the node configuration chosen by DeepStorage with only one difference: Nutanix sized for the same usable capacity, but went with an All Flash solution because comparing hybrid and all flash is apples and oranges and a pointless comparison.

With that said, the configuration DeepStorage chose does not reflect an optimally designed Nutanix solution. An optimally designed solution would likely result in fewer nodes by using 14c or 18c processors to match the high RAM configuration (512GB) and different (lower) capacity SSDs (such as 1.2TB or 1.6TB) which would deliver the same performance and still meet the capacity requirements which would result in a further advantage in both CAPEX, OPEX and TCO (Total Cost of Ownership).

The TCO shows that the CAPEX is typically in the favour of the Nutanix all flash solution. We have chosen to show the costs at different stages in scaling from 4 to 32 nodes – the same as the DeepStorage report. The FlashStack product had slightly lower CAPEX on a few occasions which is not surprising and also not something we tried to hide to make Nutanix always look cheaper.

One thing which was somewhat surprising is that even with the top of the range Pure M70 controllers and a relatively low IO per VM assumption of 250, above 24 nodes the Pure system could not support the required IOPS and an additional M20 needed to be added to the solution. What was not surprising is in the event an additional pair of controllers and SSD is added to the FlashStack solution, that the Nutanix solution had vastly lower CAPEX/OPEX and of course TCO. However, I wanted to show what the figures looked like if we assume IOPS was not a constraint for Pure FlashStack as could be the case in some customer environments as customer requirements vary.

PureVNutanixComparisonWithLowerIOPS

What we see above is the difference in CAPEX is still just 14.0863% at 28 nodes and 13.1272% difference at 32 nodes in favor of Pure FlashStack.

The TCO, however, is still in favor of Nutanix at 28 nodes by 8.88229% and 9.70447% difference at 32 nodes.

If we talk about the system performance capabilities, the Nutanix platform is never constrained by IOPS due to the scale out architecture.

Based on Pure Storage advertised performance and a conservative 20K IOPS (@ 32K) per Nutanix node, we see (below) that Nutanix IO capability is always ahead of Pure FlashStack, with the exception of a 4 node solution based on our conservative IO assumptions. In the real world, even if Nutanix was only capable of 20K IOPS per node, the platform vastly exceeds the requirements in this example (and in my experience, in real world solutions) even at 4 node scale.

PurevsNTNXIOPS

I’ve learned a lot, as well as re-validated some things I’ve previously discovered, from the exercise of contributing to this Total Cost of Ownership (TCO) analysis.

Some of the key conclusions are:

  1. In many real world scenarios, data reduction is not required to achieve a lower TCO than a competing product which leverages data reduction.
  2. Even the latest/greatest dual controller SANs still suffer the same problems of legacy storage when it comes to scaling to support capacity/IO requirements
  3. The ability to scale without rip/replace storage controllers greatly simplifies customers sizing
  4. Nutanix has a strong advantage in Power, Cooling, Rack Space and therefore helps avoid additional datacenter related costs.
  5. Even the top of the range All Flash array from arguably the top vendor in the market (Pure Storage) cannot match the performance (IOPS or throughput) of Nutanix.

The final point I would like to make is the biggest factor which dictates the cost of any platform, be it the CAPEX, OPEX or TCO is the requirements, constraints, risks and assumptions. Without these, and a detailed TCO any discussion of cost has no basis and should be disregarded.

In our TCO, we have detailed the requirements, which are in line with the DeepStorage report but go further to make a solution have context. The Nutanix TCO report covers the high level requirements and assumptions in the Use Case Descriptions.

Without further ado, here is the link to the Total Cost of Ownership comparison between Pure FlashStack and Nutanix Enterprise Cloud platform along with the analysis by Steve Kaplan.

Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

One of the most common mistakes people make when designing solutions is making assumptions. Assumptions in short are things an architect has failed to investigate and/or validate which puts a project at risk of not delivering the desired business outcome/s.

A great example of a really bad assumption to make is what data reduction ratio a storage platform will deliver.

But what if a vendor offers a data reduction guarantee and promises to give you as much equipment required if the ratio is not achieved, you’re protected right? The risk of your assumption being wrong is mitigated with the promise of free storage. Hooray!

Let’s explore this for a minute using an example of one of the more ludicrous guarantees going around the industry at the moment:

A guarantee of 10:1 data reduction!

Let’s say we have 100TB of data, that means we’d only need 10TB right? This might only be say, 4RU of equipment which sounds great!

After deployment, we start migrating and we only get a more realistic 2:1 data reduction, at which point the project stalls due to lack of capacity.

I go back to the vendor and lets say, best case scenario they agree on the spot (HA!) to give you more equipment, its unlikely to be delivered in less than 4 weeks.

So your project is delayed a minimum of 4 weeks until the equipment arrives. You now need to go through your change control process and if you’re doing this properly it would be documented with detailed steps on how to install the equipment, including appropriate back out strategies in the event of issues.

Typically change control takes some time to prepare, go through approvals, documentation etc especially in larger mission critical environments.

When installing any equipment you should also have documented operational verification steps to ensure the equipment has been installed correctly and is highly available, performing as expected etc.

Now that the new equipment is installed, the project continues and all 100TB of your data has been migrated to the new platform. Hooray!

Now let’s talk about the ongoing implications of the assumption of 10:1 data reduction only resulting in a much more realistic 2:1 ratio.

We now have 5x more equipment than we expected, so assuming the original 10TB was 4RU, we would now have 20RU of equipment which is taking up valuable real estate in our datacenter, or which may have required you to lease another rack in your datacenter.

If the product you purchased was a SAN/NAS, you now have lower IOPS/GB as you have just added a bunch more disk shelves to the existing controllers. This is because the controllers have a finite amount of performance, and you’ve just added more drives for it to manage. More drives on a traditional two controller SAN/NAS is only a good thing if the controller is not maxed out, and with flash ever increasing in performance, Controllers will be assuming they are not already the bottleneck.

If the product was HCI, now you require considerably more network interfaces. Depending on the HCI platform, you may require more hypervisor licensing, further increasing CAPEX and OPEX.

Depending on the HCI product, can you even utilise the additional storage without changing the virtual machines configuration? It might sound silly but some products don’t distribute data throughout the cluster, rather having mirrored objects so you may even need to create more virtual disks or distribute the VMs to make use of the new capacity.

Then you need to consider if the HCI product has any scale limitations, as these may require you to redesign your solution.

What about operational expenses? We now have 5x more equipment, so our environmental costs such as power & cooling will increase significantly as will our maintenance windows where we now have to patch 5x more hypervisor nodes in the case of HCI.

Typically customers no longer size for 3-5 years due to the fact HCI is becoming the platform of choice compared to SAN/NAS. This is great but when your data reduction assumption is wrong, (in this example off by 5x) the ongoing impact is enormous.

This means as you scale, you need to scale at 5x the rate you originally designed for. That’s 5x more rack units (RU), 5x more Power, 5x more cooling required, potentially even 5x more hypervisor licensing.

What does all of this mean?

Your Total Cost of Ownership (TCO) and Return on Investment (ROI) goes out the window!

Interestingly, Nutanix recently considered offering a data reduction guarantee and I was one of many who objected and strongly recommended we not drop to the levels of other vendors just because it makes the sales cycle easier.

All of the reasons above and more were put to Nutanix product management and they made the right decision, even though Nutanix data reduction (and avoidance) is very strong, we did not want to put customers in a position where their business outcomes were potentially at risk due to assumptions.

Summary:

While data reduction is a valuable part of a storage platform, the benefits (data reduction ratio) can and do vary significantly between customers and datasets. Making assumptions on data reduction ratios even when vendors provide lots of data showing their averages and providing guarantees, does not protect you from potentially serious problems if the data reduction ratios are not achieved.

In Part 2, I will go through an example of how misleading data reduction guarantees can be.