NetApp HCI Versus Nutanix – The Rebuttal

I was made aware of a recent article from Rob Klusman at Netapp titled “Netapp HCI Verses Nutanix” by a Nutanix Technology Champion (NTC) who asked for us to respond to the article “cause there’s some b*llsh*t in it”.

** UPDATE **

Netapp have since removed the post, it can now be viewed via Google Cache here:

http://webcache.googleusercontent.com/search?q=cache:https://blog.netapp.com/netapp-hci-vs-nutanix/

I like it when people call it like it is, so here I am responding to the bullshit (article).

The first point I would like to address is the final statement in the article.

NetApp HCI is the first choice, and Nutanix is the second choice. Leading in an economics battle just doesn’t work if performance is lacking.

Rob rightly points out Nutanix leads the economic battle so kudos for that, but he follows up by implying Nutanix performance is lacking. Wisely Rob does not provide any follow up which can be discredited, so I will just leave you with these three posts discussing how Nutanix scales performance for Single VMs, Monster VMs and Physical servers from my Scalability, Resiliency & Performance blog series.

Part 3 – Storage Performance for a single Virtual Machine
Part 4 – Storage Performance for Monster VMs with AHV!
Part 5 – Scaling Storage Performance for Physical Machines

Rob goes on to make the claim:

Nutanix wants infrastructure “islands” to spread out the workloads

This is just incorrect and not only is it incorrect, Nutanix has been recommending mixed workload deployments for many years. Here is an article I wrote in July 2016 titled “The All-Flash Array (AFA) is Obsolete! where I conclude with the following summary:

MixedWorkloads2016

I specifically state mixed workloads including business critical applications are supported without creating silos. It’s important to note this statement was made in July 2016 before Netapp had even started shipping (Oct 25th 2017) their 3-tier architecture product which they continue to incorrectly refer to as HCI.

Gartner supports my statement that the Netapp product is not HCI and states:

“NetApp HCI competes directly against HCI suppliers, but its solution does not meet Gartner’s functional definition of HCI.”

Mixed workloads is nothing new for Nutanix, and not only is mixing workloads supported, I frequently recommend it as it increases performance and resiliency as described in detail in my blog series Nutanix | Scalability, Resiliency & Performance.

Now let’s address the “Key Differences” Netapp claim:

User interface. Both products have an intuitive graphical interface that is well integrated into the hypervisor of choice. But what’s not obvious is that simplicity goes well beyond where you click. NetApp HCI has the most extensive API in the market, with integration that allows end users to automate even the most minute features in the NetApp HCI stack.

The philosophy of Nutanix intuitive GUI (which Netapp concedes) is all features in the GUI must be made available via an API. In the PRISM GUI Nutanix provides the “REST API Explorer” (shown below) where users can easily understand the available operations to automate anything they choose.

RestAPIexplorer

NutanixRESTAPI

Next up we have:

Versatile scale. How scaling is accomplished is important. NetApp HCI scales in small infrastructure components (compute, memory, storage) that are all interchangeable. Nutanix requires growth in specific block components, limiting the choices you can make.

When vendors attack Nutanix, I am always surprised they try and attack the scalability capabilities as if anything, this is one of the strongest areas for Nutanix.

I’ve already referenced my my Scalability, Resiliency & Performance blog series where I go into a lot of detail on these topics but in short, Nutanix can scale:

  1. Storage Only by adding drives or nodes
  2. Compute Only by adding RAM or nodes
  3. Compute + Storage by adding drives and/or nodes

Back in mid 2013 when I joined Nutanix, the claim by Netapp was true as only one node type (NX-3450) was available, but later that same year the 1000 and 6000 series were released giving more flexibility and things have continued to become more flexible over the years.

Today the flexibility (or versatility) in scale for Nutanix solutions is second to none.

Performance. Today, it’s an absolute requirement for HCI to have an all-flash solution. Spinning disks are slightly less expensive, but you’re sacrificing production workloads. NetApp HCI only offers an all-flash solution.

Congratulations Netapp, you do all flash, just like everyone else (but you came to the party years later). There a many use cases for bulk storage capacity, be it all flash or hybrid, Nutanix provides NVMe+SATA-SSD, All SATA-SSD and SATA-SSD+SAS/SATA HDD options to cover all use cases and requirements.

Not only that but Nutanix allows mixing of All Flash and Hybrid nodes to further avoid the creation of silos.

Enterprise ready. This is an important test. One downfall of Nutanix software running on exactly the same CPU cores as your applications is the effect on enterprise readiness. Many of our customers have shifted away from Nutanix once they’ve seen what happens when a Nutanix component fails. It’s easier to move the VM workload off the current Nutanix system (the one that’s failing) than it is to wait for the fix. Nutanix does not run optimally in hardware-degraded situations. NetApp HCI has no such problem; it can run at full workloads, full bandwidth, and full speed while any given component has failed.

It’s a huge claim by Netapp to dispute Nutanix’ enterprise readiness, considering we have many more years of experience shipping product but hey, Netapp’s article is proving to be without factual basis every step of the way.

The beauty of Nutanix is the ability to self heal after failures (hardware or software) and then tolerate subsequent failures. Nutanix also has the ability to tolerate multiple concurrent failures including up to 8 nodes and 48 physical drives (NVMe/SSD/HDD).

Nutanix can also tolerate one or more failures and FULLY self heal without any hardware being replaced. This is critical as I detailed in my post: Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

For more details on these failure scenarios checkout the Resiliency section of my blog series Nutanix | Scalability, Resiliency & Performance.

Workload performance protection. No one should attempt an advanced HCI deployment without workload performance protection. Only NetApp HCI provides such a guarantee, because this protection is built into the native technology.

 

One critical factor in delivering consistent high performance is data locality. The further data is from the compute layer, the more bottlenecks there are to potentially impact performance.  It’s important to Evaluate Nutanix’ original & unique implementation of Data Locality to understand that features such as QoS for Storage IO are features which are critical with scale up shared storage (a.k.a SAN/NAS) but when using a highly distributed scale out architecture, noisy neighbour problems are all but eliminated by the fact you have more controllers and that the controllers are local to the VMs.

Storage QoS is added complexity, and only required when a product such as a SAN/NAS has no choice but to deal with the IO blender effect where sequential IO is received as random due to competing workloads, this effect is minimised with Nutanix Distributed Storage Fabric.

Shared CPU cores. One key technical difference between the Nutanix product and NetApp HCI is the concept of shared CPU cores. Nutanix has processes running in the same cores as your applications, whereas NetApp HCI does not. There is a cost associated with sharing cores when applications like Oracle and VMware are licensed by core count. You actually pay more for those applications when Nutanix runs their processes on your cores. It’s important to do that math.

I’m very happy Rob raised the point regarding VMware’s licensing (part of what I’d call #vTAX), this is one of the many great reasons to move to Nutanix next generation hypervisor AHV (Acropolis Hypervisor).

In addition, for workloads like Oracle or SQL where licensing is an issue, Nutanix offers two solutions which address these issues:

  1. Compute Only Nodes running AHV
  2. Acropolis Block Services (ABS) to provide the Nutanix Distributed Storage Fabric (ADSF) to physical or virtual servers not running on Nutanix HCI nodes.

But what about the Nutanix Controller VM (CVM) itself? It is assigned vCPUs which share physical CPU cores with other virtual machines.

Sharing Physical cores is a bad idea as virtualisation has taught us over many years. Hold on, wait, no that’s not it (LOL!), Virtualisation has taught us we can share physical CPU cores very successfully even for mission critical applications where it’s done correctly.

Here is a detailed post on the topic titled: Cost vs Reward for the Nutanix Controller VM (CVM)

Asset fluidity. An important part of the NetApp scale functionality is asset fluidity – being able to move subcomponents of HCI around to different applications, nodes, sites, and continents and to use them long beyond the 3-year depreciation cycle.

This is possibly the weakest argument in Netapp’s post, Nutanix nodes can be removed non disruptively from a cluster and added to any other cluster including mixing all flash and hybrid. Brand new nodes can be mixed with any other generation of nodes, I regularly form large clusters using multiple generations of hardware.

Here is a tweet of mine from 2016 showing a 22 node cluster with four different node types across three generations of hardware (G3 being the original NX-8150, G4 and G5).

Data Fabric. The NetApp Data Fabric simplifies and integrates data management across clouds and on the premises to accelerate digital transformation. To plan an enterprise rollout of HCI, a Data Fabric is required – and Nutanix has no such thing. NetApp delivers a Data Fabric that’s built for the data-driven world.

I had to look up what Netapp mean by “Data Fabric” as it sounded to me like a nonsense marketing term, and surprise surprise I was right. Here is how Netapp describe “Data Fabric“.

Data Fabric is an architecture and set of data services that provide consistent capabilities across a choice of endpoints spanning on-premises and multiple cloud environments.

It’s a fluffy marketing phrase but the same could easily be argued about Nutanix Distributed Storage Fabric (ADSF). ADSF is hypervisor agnostic which straight away delivers a multiple platform solution (cloud or on premises) including AWS and Azure (below).

CloudSite

Nutanix can replicate and protect data including virtual machines across different hardware, clusters, hypervisors and clouds.

So the claim “Nutanix does not have a Data Fabric” is pretty laughable based on Netapp’s own description of “Data Fabric”.

Now the final point:

Choosing the Right Infrastructure for Your Enterprise

I’ve written about Things to consider when choosing infrastructure and my conclusion was:

ThingtoconsiderSummary

Nutanix has for many years provided a platform which can be your standard for all workloads and the number of niche workloads that cannot be genuinely supported are now so rare with all the enhancements we’ve made over the years.

The best thing about Nutanix, with our world class enterprise architect enablement and Nutanix Platform Expert (NPX) certification programmes, we ensure our field S.Es , Architects and certified individuals that design and implement solutions for customers every day know exactly when to say “No”.

This culture of customer success first, sales last, comes from our former President Sudheesh Nair who wrote this excellent article during his time at Nutanix

Quite possibly the most powerful 2-letter word in Sales – No

After addressing all the points raised by Netapp, it’s easy to see that Nutanix has a very complete solution thanks to years of development and experience with enterprise customers and their mission critical applications.

Have you read any other “b*llsh*t” you’d like Nutanix to respond to, if so, don’t hesitate to reach out.

The All-Flash Array (AFA) is Obsolete!

Over the last few years, I’ve had numerous customers ask about how Nutanix can support bare metal workloads. Up until recently, I haven’t had an answer the customers have wanted to hear.

As a result, some customers have been stuck using their exisiting SAN or worse still being forced to go out and buy a new SAN.

As a result many customers who have wanted to use or have already deployed hyperconverged infrastructure (HCI) for all other workloads are stuck managing an all flash array silo to service some bare metal workloads.

In June at .NEXT 2016, Nutanix announced Acropolis Block Services (ABS) which now allows bare metal workloads to be serviced by new or existing Nutanix clusters.

ABSoverview

As Nutanix has both hybrid (SSD+SATA) and all-flash nodes, customers can chose the right node type/s for their workloads and present the storage externally for bare metal workloads while also supporting Virtual Machines and Acropolis File Services (AFS) and containers.

So why would anyone buy an all-flash array? Let’s discuss a few scenarios.

Scenario 1: Bare metal workloads

Firstly, what applications even need bare metal these days? This is an important question to ask yourself. Challenge the requirement for bare metal and see if the justifications are still valid and if so, has anything changed which would allow virtualization of the applications. But this is a topic for another post.

If a customer only needs new infrastructure for bare metal workloads, deploying Nutanix and ABS means they can start small and scale as required. This avoids one of the major pitfalls of having to size a monolithic centralised, dual controller storage array.

While some AFA vendors can/do allow for non-disruptive controller upgrades, it’s still not a very attractive proposition, nor is it quick or easy. and reduces resiliency during the process as one of two controllers are offline. Nutanix on the other hand performs one click rolling upgrades which mean the largest the cluster, the lower the impact of an upgrade as it is performed one node at a time without disruption and can also be done without risk of a subsequent failure taking storage offline.

If the environment will only ever be used for bare metal workloads, no problem. Acropolis Block Services offers all the advantages of an All Flash Array, with far superior flexibility, scalability and simplicity.

Advantages:

  1. Start small and scale granularly as required allowing customers to take advantage of newer CPU/RAM/Flash technologies more frequently
  2. Scale performance and capacity by adding node/s
  3. Scale capacity only with storage-only nodes (which come in all flash)
  4. Automatically scale multi-pathing as the cluster expands
  5. Solution can support future workloads including multiple hypervisors / VMs / file services & containers without creating a silo
  6. You can use Hybrid nodes to save cost while delivering All Flash performance for workloads which require it by using VM flash pinning which ensures all data is stored in flash and can be specified on a per disk basis.
  7. The same ability as an all flash array to only add compute nodes.

Disadvantages:

  1. Your all-flash array vendor reps will hound you.

Scenario 2: Mixed workloads inc VMs and bare metal

As with scenario 1, deploying Nutanix and ABS means customers can start small and scale as required. This again avoids the major pitfall of having to size a monolithic centralised, dual controller storage array and eliminates the need for separate environments.

Virtual machines can run on compute+storage nodes while bare metal workloads can have storage presented by all nodes within the cluster, including storage-only nodes. For those who are concerned about (potential but unlikely) noisy neighbour situations, specific nodes can also be specified while maintaining all the advantages of Nutanix one-click, non-disruptive upgrades.

Advantages:

  1. Start small and scale granularly as required allowing customers to take advantage of newer CPU/RAM/Flash technologies more frequently
  2. Scale performance and capacity by adding node/s
  3. Scale capacity only with storage-only nodes (which also come in all flash)
  4. Automatically scale multi-pathing for bare metal workloads as the cluster expands
  5. Solution can support future workloads including multiple Hypervisors / VMs / file services & containers without creating a silo.

Disadvantages:

  1. Your All-Flash array vendor reps will hound you.

What are the remaining advantages of using an all flash array?

In all seriousness, I can’t think of any but for fun let’s cover a few areas you can expect all-flash array vendors to argue.

Performance

Ah the age old appendage measuring contest. I have written about this topic many times, including in one of my most popular posts “Peak performances vs Real world performance“.

The fact is, every storage product has limits, even all-flash arrays and Nutanix. The major difference is that Nutanix limits are per cluster rather than per Dual Controller Pair, and Nutanix can continue to scale the number of nodes in a cluster and continue to increase performance. So if ultimate performance is actually required, Nutanix can continue to scale to meet any performance/capacity requirements.

In fact, with ABS the limit for performance is not even at the cluster layer as multiple clusters can provide storage to the same bare metal server/s while maintaining single pane of glass management through PRISM Central.

I recently completed some testing with where I demonstrated the performance advantage of storage only nodes for virtual machines as well as how storage-only nodes improve performance for bare metal servers using Acropolis Block Services which I will be publishing results for in the near future.

Data Reduction

Nutanix has had support for deduplication, compression for a long time and introduced Erasure Coding (EC-X) mid 2015. Each of these technologies are supported when using Acropolis Block Services (ABS).

As a result, when comparing data reduction with all-flash array vendors, while the implementation of these data reduction technologies varies between vendors, they all achieves similar data reduction ratios when applied to the same dataset.

Beware of some vendors who include things like backups in their deduplication or data reduction ratios, this is very misleading and most vendors have the same capabilities. For more information on this see: Deduplication ratios – What should be included in the reported ratio?

Cost

Here we should think about what are the age old problems are with centralized shared storage (like AFAs)? Things like choosing the right controllers and the fact when you add more capacity to the storage, you’re not (or at least rarely) scaling the controller/s at the same time come to mind immediately.

With Nutanix and Acropolis Block Services you can start your All Flash solution with three nodes which means a low capital expenditure (CAPEX) and then scale either linearly (with the same node types) or non-linearly (with mixed types or storage only nodes) as you need to without having to rip and replace (e.g.: SAN controller head swaps).

Starting small and scaling as required also allows you to take advantage of newer technologies such as newer Intel chipsets and NVMe/3D XPoint to get better value for your money.

Starting small and scaling as required also minimizes – if not eliminates – the risk of oversizing and avoids unnecessary operational expenses (OPEX) such as rack space, power, cooling. This also reduces supporting infrastructure requirements such as networking.

Summary:

As shown below, the Nutanix Acropolis Distributed Storage Fabric (ADSF) can support almost any workload from VDI to mixed server workloads, file, block , big data, business critical applications such as SAP / Oracle / Exchange / SQL and bare metal workloads without creating silos with point solutions.

NutanixSingleFabricAllWorkloads

In addition to supporting all these workloads, Nutanix ADSF scalability both from a capacity/performance and resiliency perspective ensures customers can start small and scale when required to meet their exact business needs without the guesswork.

With these capabilities, the All-Flash array is obsolete.

I encourage everyone to share (constructively) your thoughts in the comments section.

Note: You must sign in to comment using WordPress, Facebook, LinkedIn or Twitter as Anonymous comments will not be approved,

Related Articles:

  1. Things to consider when choosing infrastructure.

  2. Scale out performance testing with Nutanix Storage Only Nodes

  3. What’s .NEXT 2016 – Acropolis Block Services (ABS)

  4. Scale out performance testing of bare metal workloads on Acropolis Block Services (Coming soon)

  5. What’s .NEXT 2016 – Any node can be storage only

  6. What’s .NEXT 2016 – All Flash Everywhere!

Example Architectural Decision – Datastore (LUN) and Virtual Disk Provisioning (Thin on Thin)

Problem Statement

In a vSphere environment, What is the most suitable disk provisioning type to use for the LUN and the virtual machines to ensure minimum storage overhead and optimal performance?

Requirements

1. Ensure optimal storage capacity utilization
2. Ensure storage performance is both consistent & maximized

Assumptions

1. vSphere 5.0 or later
2. VAAI is supported and enabled
3. The time frame to order new hardware (eg: New Disk Shelves) is <= 4 weeks
4. The storage solution has tools for fast/easy capacity management

Constraints

1. Block Based Storage

Motivation

1. Increase flexibility
2. Ensure physical disk space is not unnecessarily wasted

Architectural Decision

“Thin Provision” the LUN at the Storage layer and “Thin Provision” the virtual machines at the VMware layer

(Optional) Do not present more LUNs (capacity) than you have underlying physical storage (Only over-commitment happens at the vSphere layer)

Justification

1. Capacity management can be easily managed by using storage vendor tools such eg: Netapp VSC / EMC VSI / Nutanix Command Center
2. Thin Provisioning minimizes the impact of situations where customers demand a lot of disk space up front when they only end up using a small portion of the available disk space
3. Increases flexibility as all unused capacity of all datastores and the underlying physical storage remains available
4. Creating VMs with “Thick Provisioned – Eager Zeroed” disks would unnessasarilly increase the provisioning time for new VMs
5. Creating VMs as “Thick Provisioned” (Eager or Lazy Zeroed) does not provide any significant benefit (ie: Performance) but adds a serious capacity penalty
6. Using Thin Provisioned LUNs increases the flexibility at the storage layer
7. VAAI automatically raises an alarm in vSphere if a Thin Provisioned datastore usage is at >= 75% of its capacity
8. The impact of SCSI reservations causing performance issues (increased latency) when thin provisioned virtual machines (VMDKs) grow is no longer an issue as the VAAI Atomic Test & Set (ATS) primitive alleviates the issue of SCSI reservations.
9. Thin provisioned VMs reduce the overhead for Storage vMotion , Cloning and Snapshot activities. Eg: For Storage vMotion it eliminates the requirement for Storage vMotion (or the array when offloaded by VAAI XCOPY Primitive) to relocate “White space”
10. Thin provisioning leaves maximum available free space on the physical spindles which should improve performance of the storage as a whole
11. Where there is a real or perceved issue with performance, any VM can be converted to Thick Provisioned using Storage vMotion not disruptivley.
12. Using Thin Provisioned LUNs with no actual over-commitment at the storage layer reduces any risk of out of space conditions while maintaining the flexibility and efficiency with significantly reduce risk and dependency on monitoring.
13. The VAAI UNMAP primitive provides automated space reclamation to reduce wasted space from files or VMs being deleted

Alternatives

1.  Thin Provision the LUN and thick provision virtual machine disks (VMDKs)
2.  Thick provision the LUN and thick provision virtual machine disks (VMDKs)
3.  Thick provision the LUN and thin provision virtual machine disks (VMDKs)

Implications

1. If the storage at the vSphere and array level is not properly monitored, out of space conditions may occur which will lead to downtime of VMs requiring disk space although VMs not requiring additional disk space can continue to operate even where there is no available space on the datastore
2. The storage may need to be monitored in multiple locations increasing BAU effort
3. It is possible for the vSphere layer to report sufficient free space when the underlying physical capacity is close to or entirely used
4. When migrating VMs from one thin provisioned datastore to another (ie: Storage vMotion), the storage vMotion will utilize additional space on the destination datastore (and underlying storage) while leaving the source thin provisioned datastore inflated even after successful completion of the storage vMotion.
5.While the VAAI UNMAP primitive provides automated space reclamation this is a post-process, as such you still need to maintain sufficient available capacity for VMs to grow prior to UNMAP reclaiming the dead space

Related Articles

1. Datastore (LUN) and Virtual Disk Provisioning (Thin on Thick)CloudXClogo