Public Cloud Challenges – Part 1: Network performance

Public Cloud products such as AWS EC2 have for a long time been offering the ability to rent “bare metal” servers for customers and partners to deploy their chosen workloads/solutions.

This gives hyper-converged (HCI) vendors another way to deliver their solutions to customers with more flexibility to scale on demand than what an on-prem environment, mostly due to the fact additional hardware is on-standby and can be relatively quickly deployed.

The problem with offerings like AWS EC2 is that the network is something you have very little control over in terms of the number of connections per bare metal instance and the bandwidth available.

Take the popular i3.metal instance for example, it has 36 physical CPU cores, 512GB Memory & a claimed 25 Gigabit network performance to go along with 8 x 1.9TB NVMe SSDs.

Reference: https://aws.amazon.com/ec2/instance-types/i3/

This configuration is a great fit for advanced HCI solutions like Nutanix, but may not be a great fit for products like vSAN which depend much more heavily on the network.

Why did I say “a claimed 25Gigabit network performance”?

Well because you’re not really getting 25Gbps, you’re getting up to 10Gbps per connection if the i3.metal instances are within the same rack, or 5Gbps for instances across racks.

The below is a test running on a brand new and idle cluster which shows an example of the significant variability which can/does occur in AWS EC2.

Let’s talk about the implications of the network viability for HCI style solutions deployed in AWS EC2.

  1. Storage traffic vs Virtual Machine traffic

If storage traffic was to monopolise the network, there would be very little available bandwidth for Virtual machines traffic, so even if the storage layer performed well, the overall performance from the end user or business critical applications perspective may be poor.

2. Infrastructure Operational Traffic

To perform operations such as vMotion the cluster needs network bandwidth to ensure these “burst” style operations can complete both successfully and in a timely manner.

Slow vMotion’s mean higher impact to business applications and slower maintenance or performance optimising operations via vSphere DRS or Nutanix ADS.

This also impacts the environments ability to perform upgrades of the hypervisor and storage layers.

3. Business Continuity traffic

What happens when a drive, node or even rack fails?

The platform needs to leverage the network to re-protect the data (replicas). If the network is heavily utilised by Virtual Machines, Storage Traffic and even operational traffic like vMotion, the ability for the platform to re-protect data is compromised which ultimately increases the risk of performance impact to line of business applications & worse still of data loss.

How does Nutanix Enterprise Cloud architecture mitigate these risks?

  1. Minimising the impact of a failure

With Nutanix, a single drive failure (in this case a 1.9TB NVMe SSD in the AWS i3.metal instance) will only ever result in up to that single drives capacity needing to be rebuilt.

This is because Nutanix does not use the concept of Cache drives & Disk Groups which increase risk as a single cache drive failure will result in an entire disk group (up to 7 capacity drives or 7 x 1.9TB) of data needing to be rebuild.

If Deduplication & Compression are used with Cache Drives & Disk Groups, any single drive failure results in the entire disk group needing to be rebuilt, which is up to 13.3TB of data, compared to up to 1.9TB with Nutanix.

2. Minimising the dependancy on the network

Nutanix AOS unique data locality functionality ensures that the majority of Reads I/O does not traverse the network and typically only half (one replica) of write I/O is sent over the network.

As a result, with a 70/30 split between Reads and Write I/O we end up with only 15% of the total I/O (half the write replicas) being sent over the network.

Even after a VM moves to another host, Nutanix still services new writes locally with one replica sent over the network. This ensures subsequent reads are typically local as new data is often the most active data.

In contrast to a leading HCI provider (in terms of claimed revenue), where the virtualised storage is typically sending the majority of read & write I/O across the network.

I’ve covered the advantages of data locality in the post: Think Data Locality is just about Storage Performance? Think again! which highlights that data locality is designed to address numerous potential constraints and is not just about storage performance.

The Proof is in the pudding.

The following chart shows read bandwidth achieved by Nutanix on i3.metal instances from 4 nodes through to 32 nodes.

What we see apart from the linear nature of the performance as the cluster scales, but the incredible level of performance which is achieved.

Important to note, this result was achieved with VMs being migrated throughout the cluster. The reason Nutanix achieved such high performance is because new data is always written locally, so subsequent reads are always local.

Take the 4 node example (Orange Line) where throughput of 19.96GBps is achieved. This equates to 159Gbps of potential network traffic, but the 4 nodes only have a single 25Gbps NIC each, meaning this level of performance is not realistically possible without data locality even if we assumed all 25Gbps was always available in AWS EC2 which of course it’s not.

Assuming 100% of the 25Gbps was available, we’re still only talking about a maximum of 3.125GBps of storage throughput per node, which in this case drops our performance from 19.96GBps to 12.5GBps for the 4 nodes.

In AWS EC2 with i3.metal hosts which have all NVMe, any architecture dependant on the network will not be able to make the most efficient use of the hardware and therefore have a higher TCO and/or lower ROI.

Keep in mind Nutanix AOS achieved the 19.96GBps with ZERO network usage thanks to data locality.

If we take the actual AWS network limits, and/or those which may impact network performance such as overlay technologies like VMware NSX, the potential storage performance is much lower.

For example, in the iPerf testing shown above, the best performance was 9.16Gbps which means only 1.145GBps storage traffic (read and writes) could be achieved per node without data locality.

What about Write I/O?

Thanks to Nutanix Data Locality, one replica is written locally to the VM which means no network bandwidth is used, only the second replica is written across the network meaning the theoretical maximum write throughput is equal to the available network bandwidth.

Without data locality, the theoretical maximum write throughput drops to HALF the available bandwidth due to the requirement to send both replicas over the network.

If this was VMware Cloud on AWS (VMConAWS) and the Virtual Machine/s have moved off the host their storage objects are hosted on, then VMC’s maximum write throughput is half that of Nutanix assuming the entire network bandwidth is dedicated to storage, which of course it’s not.

What happens when you have network congestion?

With Nutanix, data locality minimises the dependancy on the network ensuring the maximum available bandwidth for Virtual Machines, Replication, vMotion & to ensure the cluster can rebuild in a timely manner from drive or node failures.

Without data locality, products like VMware’s VMConAWS have the near impossible task of trying to ensure functionality, resiliency and performance across what can be a highly variable network.

As a result of this and other resiliency & scalability issues with vSAN (the underlying storage for VMware Cloud on AWS), VMware recommends Failures to Tolerate of 2 (FFT2) for clusters of 6 or higher. This means 3 copies of data, or <33% usable capacity.

Reference: https://blogs.vmware.com/virtualblocks/2019/10/28/2-failure-toleration-requirements-within-vmware-cloud-on-aws

Nutanix on the other hand don’t suffer the same resiliency issues as vSAN (VMC) as described in my Nutanix vs vSAN/VxRAIL series on drive or node failures. Nutanix also always maintains write I/O integrity including during supported failure scenarios and upgrades, something vSAN/VMC does not by default.

For vSAN and therefore VMC to support write integrity during upgrades, Full Data evacuation mode needs to be used which puts an addition >=N+1 capacity requirement onto the environment not to mention the major impact on performance and upgrade duration when using this capability.

This however does not address the major risk of vSAN/VMC not maintaining data integrity for writes during drive or node failures.

Summary:

When comparing Nutanix Clusters (AOS on AWS) and VMware Cloud (VMC on AWS), it’s clear to see the two products are worlds apart in terms of network requirements/dependancies.

The Nutanix AOS architecture proactively minimises the dependancy on the network to mitigate against the potential risks in public cloud environments. On the other hand vSAN/VMC implement the same flawed architecture which is heavily dependant on the network (especially at scale) into an environment where the network bandwidth is largely out of their control & can vary significantly.

Nutanix Clusters provides rack awareness by default, this can be achieved without major impact from the network thanks to data locality. VMware VMC on the other hand being limited to 5Gbps between racks will have implications especially during maintenance, failures and/or under peak workload.

Imagine Nutanix Clusters vs VMC in say an 8 node environment. Nutanix using 2 replicas and always maintaining data integrity verses vSAN/VMC using 3 replicas (due to their own recommendations) and not being able to maintain write integrity during simple drive or node failure scenarios. Not to mention the additional capacity requirement for Full Data Evacuations to allow for data integrity during upgrades.

Combine that with vSAN/VMCs slow rebuilds which also have a conceptual flaw where one object is restore from one node to another node (as opposed to Nutanix fully distributed solution) and you see that vSAN is simply not an attractive architecture in public cloud, especially for production workloads.

Next let’s check out Public Cloud Challenges – Part 2: TCO/ROI & Storage Capacity

Related Posts:

  1. Public Cloud Challenges – Part 1 – Network performance
  2. Public Cloud Challenges – Part 2 – TCO/ROI & Storage Capacity
  3. Public Cloud Challenges – Part 3 – TCO/ROI & Storage Capacity at scale
  4. Public Cloud Challenges – Part 4 – Data Efficiency Technologies & Resiliency considerations.
  5. Public Cloud Challenges – Part 5 – Storage device failures & resiliency implications
  6. Public Cloud Challenges – Part 6 – Bare Metal Instance failures
  7. HCI Architecture Matters – Nutanix AOS vs the competition & their Cache Drives & Disk Groups
  8. Usable Capacity Comparison – Nutanix ADSF vs VMware vSAN
  9. Deduplication & Compression Comparison – Nutanix ADSF vs vSAN
  10. Erasure Coding Comparison – Nutanix ADSF vs vSAN
  11. Scaling Storage Capacity – Nutanix & vSAN
  12. Drive failure Comparison – Nutanix ADSF vs VMware vSAN
  13. Heterogeneous Cluster Support – Nutanix vs VMware vSAN
  14. Write I/O Path Comparison – Nutanix vs VMware vSAN
  15. Read I/O Path Comparison – Nutanix vs VMware vSAN
  16. Node Failure Comparison – Nutanix vs VMware vSAN/VxRAIL
  17. Storage Upgrade Comparison – Nutanix vs VMware vSAN/VxRAIL
  18. Usable Capacity Comparison PART 2 – Nutanix vs VMware vSAN/VxRAIL
  19. Memory Usage Comparison – Nutanix vs VMware vSAN/DellEMC VxRAIL
  20. Network Usage Comparison – Nutanix vs VMware vSAN/DellEMC VxRAIL
  21. Nutanix | Scalability, Resiliency & Performance
  22. Nutanix – Erasure Coding (EC-X) Deep Dive
  23. Performance impact & overheads of Inline Compression on Nutanix?
  24. My checkbox is bigger than your checkbox! by Hans De Leenheer
  25. Not all VAAI-NAS storage solutions are created equal.
  26. Automated Storage Reclaim on Nutanix Acropolis Hypervisor (AHV)