Nutanix AOS 5.5 delivers 1M read IOPS from a single VM, but what about 70/30 read/write?

I recently wrote Nutanix AOS 5.5 delivers 1M IOPS from a single VM, but what happens when you vMotion which showed the impact of a vMotion was around -10% for a period of approx. 3 seconds before read performance resumed back to pre-migration levels.

In this post I will be addressing the question about performance for a single VM with a more realistic 70% read, 30% write IO profile which was performed using an 8k IO size and what the impact is during and after a live migration.

While not surprising to Nutanix customers, this result shows a maximum starting baseline of 436K random read and 187k random write IOPS and immediately following the migration performance reduced to 359k read and 164k write IOPS before achieving greater performance than the original baseline @ 446k read and 192k IOPS within a few seconds.

So in comparison to 100% random read which achieved just over 1 million 8k IOPS, the 70/30 mix achieves in the ballpark of 600k IOPS which is very respectable. Not bad for a platform which Nutanix competitors continue to describe as only being good for VDI. Considering even the largest array from a leading all flash SAN vendor is only advertising performance in the hundreds of thousand random read range, it shows Nutanix unique hyper-converged architecture can achieve higher performance than a monolithic all flash array from a single VM.

This shows that with the unique Nutanix Acropolis Distributed Storage Fabric, very high performance at low latency can be achieved with real world IO patterns even during and after live migrating the virtual machine across a distributed platform.

This result is further evidence of the efficiency of Nutanix Acropolis Hypervisor, AHV (which is included at no additional charge with AOS) as well as the IO path running in user space (not the much hyped in-kernel). This is in part thanks to AHV Turbo Mode which optimised the IO path which was announced at .NEXT 2017 in Washington. In addition to these excellent levels of performance, they can be sustained even when using data protection features such as snapshots as shown in recent post I wrote about Nutanix X-ray tool where I used the Snapshot impact scenario to compare Nutanix AHV and a leading hypervisor and SDS product. If you don’t have time to read the post, in short, the Nutanix competitors performance degraded as snapshots were taken while Nutanix AHV’s performance remained consistent which is essential for real world scenarios, especially with business critical applications.

With Nutanix unique ability to scale out performance using storage only nodes, even higher performance can be achieved without modification to the virtual machine to applications which gives Nutanix further advantage over the competition.

Nutanix data locality ensures optimal performance by ensuring new data is always local to the VM and cold data can remain remote indefinitely while only hot data will be migrated locally if/when required at a 1MB granularity. This translates to intelligent data locality and not brute force locality as it is frequently mistaken to be.

Back to Part 1

Nutanix AOS 5.5 delivers 1M IOPS from a single VM, but what happens when you vMotion?

For many years Nutanix has been delivering excellent performance across multiple hypervisors as well as hardware platforms including the native NX series, OEMs (Dell XC & Lenovo HX) and more recently software only options with Cisco and HPE.

Recently I tweeted (below) showing how a single virtual machine can achieve 1 million 8k random read IOPS and >8GBps throughput on AHV, the next generation hypervisor.

While most of the response to this was positive, the usual negativity came from some competitors who tried to spread fear, uncertainty and doubt (FUD) about the performance including claims it was not sustainable during/after a live migration (vMotion) and that is does not demonstrate the performance of the IO path.

Let’s quickly cover of the IO path discussion of in-kernel vs a controller VM.

To test the IO path, in the case of Nutanix, via the Controller VM, you want to eliminate as many variables and bottlenecks as possible. This means a read/write test is not valid as writes are dependant on factors such as the network. As this was one a node using NVMe, the bottleneck would quickly become the network and not the path between the user VM and controller VM.

I’ve previously tweeted (below) showing an example of the throughput capabilities of SATA SSD, NVMe and 3DxPoint which clearly shows the network is the bottleneck with next generation flash.

I’ve also responded to 3rd party FUD about Nutanix Data locality with a post which goes in depth about Nutanix original & unique implementation of Data Locality which is how Nutanix minimises its dependancy on the network to deliver excellent performance.

So we are left with read IO to actually test and possibly stress the IO path between a User VM and software defined storage, be that in-kernel or in user space which is where the Nutanix CVM runs.

The tweet showing >1 million 8k random read IOPS and >8GBps throughput shows that the IO path of Nutanix is efficient enough to achieve this at just 110 micro (not milli) seconds.

The next question from those who try to discredit Nutanix and HCI in general is what happens after a vMotion?

Let me start by saying this is a valid question, but even if performance dropped during/after a vMotion is it even a major issue?

For business critical applications, it is common for vendors to recommend DRS should/must rules to prevent vMotion exception for in the event of maintenance or failure regardless of the infrastructure being traditional/legacy NAS/SAN or HCI.

With a NAS/SAN, the best case scenario is 100% remote IO where as with Nutanix this is the worse cast scenario. Let’s assume business as usual on Nutanix is 1M IOPS and during a vMotion and for a few mins after that performance dropped by 20%.

That would still be 800k IOPS which is higher than what most NAS/SAN solutions can delivery anyway.

But the fact is, Nutanix can sustain excellent performance during and after a vMotion as demonstrated by the video below which was recorded in real time. Hint: Watch the values in the putty session as these show the performance as measured at the guest level which is what ultimately matters.

Credit for the video goes to my friend and colleague Michael “Webscale” Webster (VCDX#66 & NPX#007).

The IO dropped below 1 million IOPS for approx 3 seconds during the vMotion with the lowest value recorded at 956k IOPS. I’d say an approx 10% drop for 3 seconds is pretty reasonable as the performance drop is caused by the migration stunning the VM and not by the underlying storage.

Over to our “friends” at the legacy storage vendors to repeat the same test on their biggest/baddest arrays.

Not impressed? Let’s see what 70/30 read/write workload performs!