Benchmark(et)ing Nonsense IOPS Comparisons, if you insist – Nutanix AOS 4.6 outperforms VSAN 6.2

As many of you know, I’ve taken a stand with many other storage professionals to try to educate the industry that peak performance is vastly different to real world performance. I covered this in a post titled: Peak Performance vs Real World Performance.

I have also given a specific example of Peak Performance vs Real World Performance with a Business Critical Application (MS Exchange) where I demonstrate that the first and most significant constraining factor for Exchange performance is compute (CPU/RAM) so achieving more IOPS is unnecessary to achieve the business outcome (which is supporting a given number of Exchange mailboxes/message per day).

However vendors (all of them) who offer products which provide storage, whether it is as a component such as in HCI or a fully focused offering, continue to promote peak performance numbers. They do this because the industry as a whole has and continues to promote these numbers as if they are relevant and trying to one-up each other with nonsense comparisons.

VMware and the EMC federation have made a lot of noise around In-Kernel being better performance than Software Defined Storage running within a VM which is referred to by some as a VSA (Virtual Storage Appliance). At the same time the same companies/people are recommending business critical applications (vBCA) be virtualized. This is a clear contradiction, as I explain in an article I wrote titled In-Kernel verses Virtual Storage Appliance which in short concludes by saying:

…a high performance (1M+ IOPS) solution can be delivered both In-Kernel or via a VSA, it’s simple as that. We are long past the days where a VM was a significant bottleneck (circa 2004 w/ ESX 2.x).

I stand by this statement and the in-kernel vs VSA debate is another example of nonsense comparisons which have little/no relevance in the real world. I will now (reluctantly) cover off (quickly) some marketing numbers before getting to the point of this post.

VMware VSAN 6.2

Firstly, Congratulations to VMware on this release. I believe you now have a minimally viable product thanks to the introduction of software based checksums which are essential for any storage platform.

VMW Claim One: For the VSAN 6.2 release, “delivering over 6M IOPS with an all-flash architecture”

The basic math for a 64 node cluster = ~93700 IOPS / node but as I have seen this benchmark from Intel showing 6.7Million IOPS for a 64 node cluster, let’s give VMware the benefit of the doubt and assume its an even 7M IOPS which equates to 109375 IOPS / node.

Reference: VMware Virtual SAN Datasheet

VMW Claim Two: Highest Performance >100K IOPS per node

The graphic below (pulled directly from VMware’s website) shows their performance claims of >100K IOPS per node and >6 Million IOPS per cluster.

Reference: Introducing you to the 4th Generation Virtual SAN

Now what about Nutanix Distributed Storage Fabric (NDSF) & Acropolis Operating System (AOS) 4.6?

We’re now at the point where the hardware is becoming the bottleneck as we are saturating the performance of physical Intel S3700 enterprise-grade solid state drives (SSDs) on many of our hybrid nodes. As such we have moved onto performance testing of our NX-9460-G4 model which has 4 nodes running Haswell CPUs and 6 x Intel S3700 SSDs per node all in 2RU.

With AOS 4.6 running ESXi 6.0 on a NX9460-G4 (4 x NX-9040-G4 nodes), Nutanix are seeing in excess of 150K IOPS per node, which is 600K IOPS per 2RU (Nutanix Block).

The below graph shows performance per node and how the solution scales in terms of performance up to a 4 node / 1 block solution which fits within 2RU.

NOS46Perf

So Nutanix AOS 4.6 provides approx. 36% higher performance than VSAN 6.2.

(>150K IOPS per NX9040-G4 node compared to <=110K IOPS for All Flash VSAN 6.2 node)

It should be noted the above Nutanix performance numbers have already been improved upon in upcoming releases going through performance engineering and QA, so this is far from the best you will see.

but-wait-theres-more

Enough with the nonsense marketing numbers! Let’s get to the point of the post:

These 4k 100% random read IOPS (and similar) tests are totally unrealistic.

Assuming the 4k IOPS tests were realistic, to quote my previous article:

Peak performance is rarely a significant factor for a storage solution.

More importantly, SO WHAT if Vendor A (in this case Nutanix) has higher peak performance than Vendor B (in this case VSAN)!

What matters is customer business outcomes, not benchmark(eting)!

holdup

Wait a minute, the vendor with the higher performance is telling you peak performance doesn’t matter instead of bragging about it and trying to make it sound importaint?

Yes you are reading that correctly, no one should care who has the highest unrealistic benchmark!

I wrote things to consider when choosing infrastructure. a while back to highlight that choosing the “Best of Breed” for every workload may not be a good overall strategy, as it will require management of multiple silos which leads to inefficiency and increased costs.

The key point is if you can meet all the customer requirements (e.g.: performance) with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, you’re doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads. So if Vendor X has 10% faster performance (even for your specific workload) than Vendor Y but Vendor Y still meets your requirements, performance shouldn’t be a significant consideration when choosing a product.

Both VSAN and Nutanix are software defined storage and I expect both will continue to rapidly improve performance through tuning done completely in software. If we were talking about a product which is dependant on offloading to Hardware, then sure performance comparisons will be relevant for longer, but VSAN and Nutanix are both 100% software and can/do improve performance in software with every release.

In 3 months, VSAN might be slightly faster. Then 3 months later Nutanix will overtake them again. In reality, peak performance rarely if ever impacts real world customer deployments and with scale out solutions, it’s even less relevant as you can scale.

If a solution can’t scale, or does so in 2 node mirror type configurations then considering peak performance is much more critical. I’d suggest if you’re looking at this (legacy) style of product you have bigger issues.

Not only does performance in the software defined storage world change rapidly, so does the performance of the underlying commodity hardware, such as CPUs and SSDs. This is why its importaint to consider products (like VSAN and Nutanix) that are not dependant on proprietary hardware as hardware eventually becomes a constraint. This is why the world is moving towards software defined for storage, networking etc.

If more performance is required, the ability to add new nodes and the ability to form a heterogeneous cluster and distribute data evenly across the cluster (like NDSF does) is vastly more importaint than the peak IOPS difference between two products.

While you might think that this blog post is a direct attack on HCI vendors, the principle analogy holds true for any hardware or storage vendor out there. It is only a matter of time before customers stop getting trapped in benchmark(et)ing wars. They will instead identify their real requirements and readily embrace the overall value of dramatically simple on-premises infrastructure.

In my opinion, Nutanix is miles ahead of the competition in terms of value, flexibility, operational benefits, product maturity and market-leading customer service all of which matter way more than peak performance (which Nutanix is the fastest anyway).

Summary:

  1. Focus on what matters and determine whether or not a solution delivers the required business outcomes. Hint: This is rarely just a matter of MOAR IOPS!
  2. Don’t waste your time in benchmark(et)ing wars or proof of concept bake offs.
  3. Nutanix AOS 4.6 outperforms VSAN 6.2
  4. A VSA can outperform an in-kernel SDS product, so lets put that in-kernel vs VSA nonsense to rest.
  5. Peak performance benchmarks still don’t matter even when the vendor I work for has the highest performance. (a.k.a My opinion doesn’t change based on my employers current product capabilities)
  6. Storage vendors ALL should stop with the peak IOPS nonsense marketing.
  7. Software-defined storage products like Nutanix and VSAN continue to rapidly improve performance, so comparisons are outdated soon after publication.
  8. Products dependant upon propitiatory hardware are not the future
  9. Put a high focus on the quality of vendors support.

Related Articles:

  1. Peak Performance vs Real World Performance
  2. Peak performance vs Real World – Exchange on Nutanix Acropolis Hypervisor (AHV)
  3. The Key to performance is Consistency
  4. MS Exchange Performance – Nutanix vs VSAN 6.0
  5. Scaling to 1 Million IOPS and beyond linearly!
  6. Things to consider when choosing infrastructure.

The balancing act of choosing a solution based on Requirements & Budget Constraints

I am frequently asked how to architect solutions when you have constraint/s which prevent you meeting requirements. It is not unusual for customers to have an expectation that they need and can get the equivalent of a Porsche 911 turbo for the price of a Toyota Corolla.

It’s also common for less experienced architects to focus straight away on budget constraints before going through a reasonable design phase and addressing requirements.

If you are constrained to the point you cannot afford a solution that comes close to meeting your requirements, the simple fact is, that customer is in pretty serious trouble no manner how you look at it. But this is rarely the case in my experience.

Typically customers simply need to sit down with an experienced architect and go through what business outcomes they want to achieve. Then the architect will ask a range of questions to help clarify the business goals and translate those into clearly defined Requirements.

I also find customers frequently think they need (or have been convinced by a vendor) that they need more than what they do to achieve the outcomes, e.g.: Thinking you need Active/Active datacenters with redundant 40Gb WAN links when all you need is VM High Availability and async rep to DR on 1Gb WAN links.

So my point here is always start with the following:

  • Step 1: What is the business problem/s the customer is trying to solve.

Until the customer, VAR , Solution Architect and vendor/s understand this and it is clearly documented, do not proceed any further. Without this information, and a clear record of what needs to be achieved, the project will likely fail.

At this stage its important to also understand any constraints such as Project Timelines, CAPEX budget & OPEX budget.

  • Step 2: Research potential solutions

Once you have a clear understanding of the problem, desired outcome / requirements, then its time for you to research (or engage a VAR to do this on your behalf) what potential products could provide a solution that addresses the business outcome, requirements etc.

  • Step 3: Provide VAR and/or Vendors detailed business problem/s, requirements and constraints.

This is where the customer needs to take some responsibility. A VAR or Vendor who is kept in the dark is unlikely to be able to deliver anything close to the desired outcome. While customers don’t like to give out information such as budget, without it, it just creates more work for everyone, and ultimately drives up the cost to the VAR/vendor which in turn gets passed onto customers.

With a detailed understanding of the desired business outcomes, requirements and constraints the VAR/vendor should provide a high level indicative solution proposal with details on CAPEX and ideally OPEX to show a TCO.

In many cases the lowest CAPEX solution has the highest OPEX which can mean a significantly higher TCO.

cheaper

  • Step 4: Customer evaluates High level Solution Proposal/s

The point here is to validate if the proposed solution will provide the desired business outcome and if it can do so within the budget. At this stage it is importaint to understand how the solution meets/exceeds each requirement and being able to trace a design back to the business outcomes. This is why it’s critical to document the requirements so the high level design (and future detailed design) can address each criteria.

If the proposed solution/s do meet/exceed the business requirements, then the question is, Do they fall within the allocated budget?

If so, great! Choose your preferred vendor solution and proceed.

If not, then a proposal needs to be put to the business for additional budget. If that is approved, again great and you can proceed.

If additional CAPEX/OPEX cannot be obtained then this is where the balancing act really gets interesting and an experienced architect will be of great value.

  • Step 5: Reviewing Business outcomes/requirements (and prioritise requirements!)

So lets say we have three requirements, R001, R002 and R003. All are importaint, but the simple fact at this point is the budget is insufficient to deliver them all.

This is where a good solution architect sets the customers expectations as a trusted adviser. The expectation needs to be clearly set that the budget is insufficient to deliver all the desired business outcomes and the priorities need to be set.

Sit with the customer and put all requirements into priority order, then its back and forth with the vendors to provide a lower cost (CAPEX/OPEX or both) detailing the requirement which have and have not been met and any/all implications.

In my opinion the key in these situations is not to just “buy the best you can” but to document in detail what can and cannot be achieved with detailed explanations on the implications. For example, an implication might be the RPO goes from 1hr to 8hrs and the RTO for a business critical application extended from 1hr to 4hrs. In day to day operations the customer would likely not know the difference, but if/when an outage occurred, they would know and need to be prepared. The cost of a single outage can in many cases cost much more than the solution itself, which is why its critical to document everything for the customer.

In my experience, where the implications are significant and clearly understood by the customer (at both a technical and business level), it is not uncommon for customers to revisit the budget and come up with additional funds for the project. It is the job of the Solution Architect (who should always be the customer advocate) to ensure the customer understands what the solution can/cannot deliver.

Where the customer understands the implications (a.k.a Risks), it is importaint that they sign off on the risks prior to going with a solution that does not meet all the desired outcomes. The risks should be clearly documented ideally with examples of what could happen in situations such as failure scenarios.

Basic Flowchart of the above described process:

The below is a basic flowchart showing a simplified process of gathering business requirements/outcomes through to an outcome.

Starting in the top left, we have the question “Are the business problem/s you are trying to solve and the requirements documented?

From there its a follow the bouncing ball until we come to the dreaded “Is the solution/s within the budget constraints?”. This area is coloured grey for a reason, as this is where the balancing act occurs and delays can happen as indicated with the following shape DelayIconGreay.

Once the customer fully understand what they are or are not getting AND IT IS FULLY DOCUMENTED, then and only then should a customer (or VAR/Architect recommend too) proceed with a proposal.

Flowchart01

The above is not an exhausting process, but just something to inspire some thought next time you or your customers are constrained by budget.

Summary:

  • Don’t just “Buy what you can afford and hope for the best”
  • Document how requirements are (or are not) being met (That’s “Traceability” for all you VCDX/NPX candidates)
  • Document any risks/implications and mitigation strategies
  • Get sign off on any requirements which is not met
  • Always provide a customer with options to choose from, don’t assume they wont invest more to address risks to the business
  • If you can’t meet all requirements Day 1, ensure the design is scalable. A.k.a Start small and scale if/when required.
  • Avoid low cost solutions that will likely end up having to be thrown out

Related Articles:

Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 10 – Cost

You may be surprised cost is so far down the list but as you have probably realized by reading the previous 9 parts is that AHV is in many ways a superior virtualization platform to other products on the market. In my opinion, it would be a mistake to think AHV is a “low-cost option” or “a commodity hypervisor with limited capabilities” just because it happens to be included with Starter Edition (making it effectively free for all Nutanix customers).

Apart from the obvious removal of hypervisor and associated management component licensing/ELA costs, the real cost advantage of using AHV is the dramatic reduction in effort required in the design, implementation, operational verification phases as well as ongoing management.

This is due to many factors:

Simplified Design Phase

As all AHV Management components are in-built, highly available and auto scaling, there is no need to engage a Subject Matter Expert (SME) to design the management solution. As a person who has designed countless highly available virtualization solutions over the years, I can tell you AHV out of the box is what I have all but dreamed of creating with other products for customers in the past.

Simplified Implementation Phase

All management components (with the exception of Prism Central) are deployed automatically removing the requirement for an engineer to install/patch/harden these components.

Building Acropolis and all management components into the CVM means there are fewer moving parts that can go wrong and therefore that need to be verified.

In my experience, Operational Verification is one of the areas regularly overlooked and infrastructure is put into production without having proven it meets the design requirements and outcomes. With AHV management components deployed automatically, the risk of components not delivering is all but eliminated and where Operational Verification is performed, it can be completed much faster than traditional products due to having much fewer moving parts.

Simplified ongoing operations

Acropolis provides One-Click fully automated rolling upgrades for Acropolis Base Software (formally known as NOS), Acropolis Hypervisor, Firmware and Nutanix Cluster Check (NCC). In addition, upgrades can be automatically downloaded removing the risk of installing incompatible versions and the requirement to check things such as Hardware Compatability Lists (HCLs) and interoperability matrix’ before upgrades.

AHV dramatically simplifies Capacity management by only requiring capacity management to be done at the Storage Pool layer; there is no requirement for administrators to manage capacity between LUNs/NFS mounts or Containers. This capability also eliminates the requirement for well-known hypervisor features such as vSphere’s Storage DRS.

Reduced 3rd party licensing costs

AHV includes all management components, or in the case of Prism Central, come as a prepackaged appliance. There is no need to license any operating systems. The highly resilient management components on every Nutanix node eliminates the requirement for 3rd party database products such as Microsoft SQL or Oracle or best case scenario, the deployment of Virtual Appliances which may not be highly available and which needs to be backed up and maintained.

Reduced Management infrastructure costs

It is not uncommon for virtualization solutions to require a dozen or more management components (each potentially on a dedicated VM) even for small deployments to get all the functionality such as centralized management, patching and performance/capacity management. As deployments grow or have higher availability requirements, the number of management VMs and their compute requirements tend to increase.

As all management components run within the Nutanix Controller VM (CVM) which resides on each Nutanix node, there is no need to have a dedicated management cluster. The amount of compute/storage resources are also reduced.

The indirect cost savings for the reduced management infrastructure include:

  1. Less rack space (RU)
  2. Less power/cooling
  3. Fewer network ports
  4. Less compute nodes
  5. Lower storage capacity & performance requirements

Last but not least, what about the costs associated with maintenance windows or outages?

Because Acropolis provides fully non-disruptive one-click upgrades and removes numerous points of failure (e.g.: 3rd Party Databases) while providing an extremely resilient platform, AHV also reduces the cost to the customer of maintenance and outages.

Summary:

  1. No design required for Acropolis management components
  2. No ongoing maintenance required for management components
  3. Reduced complexity reduces the chance of downtime as a result of human error

Back to the Index