Nutanix Scalability – Part 3 – Storage Performance for a single Virtual Machine

Continuing on from Part 1 where we discussed how Nutanix can scale storage capacity separately to compute using storage only nodes, we will now cover how Nutanix can scale the storage performance of a Virtual server including beyond the capabilities of a single (scaled up) node.

Virtual machines, just like traditional physical servers benefit from having multiple storage controllers (e.g.: RAID controllers) and multiple drives regardless of their type (e.g.: HDD/SSD etc).

The same is true for VMs on Nutanix ADSF, more storage controllers and more virtual disks increase the storage performance.

For traditional hypervisors such as ESXi and Hyper-V, ensure you have the maximum four Paravritual SCSI controllers (PVSCSI) assigned to VM’s requiring the highest performance and low latency. Having multiple controllers means more queues are available to the virtual disks and therefore the less bottlenecks which can cause latency and inefficiency for the vCPUs assigned to the VM.

Because the benefits of multiple virtual SCSI adapters is so significant, Nutanix decided to ensure this functionality is achieved by default when using Nutanix’ next Generation Hypervisor, AHV with what is known as “Turbo Mode“.

This means virtual machines running on AHV are optimised by default at the virtual storage controller layer, removing the complexity for customers having to understand and configure virtual storage controllers.

Regarding Virtual disks, for optimal performance you’ll need to use at least four assuming you’re using the recommended four Paravirtual SCSI controllers (i.e.: One per SCSI controller) but since we’re talking about scaling performance, let’s talk about more extreme examples.

Let’s say we have an MS Exchange server with 20 databases, the performance requirements for each database is typically in the range of hundred of IOPS, in which case I would recommend one virtual disk (e.g.: VMDK) per database and another for the logs.

In the case of a large MS SQL server which may require tens or hundreds of thousands of IOPS to a single database, I recommend using multiple vDisks per database which involves Splitting SQL datafiles across multiple VMDKs to optimise VM performance.

In both examples, the virtual disks would be spread evenly across the four PVSCSI controllers if using ESXi or Hyper-V whereas AHV customers would just create the vDisks and each vDisk would by default enjoy a dedicated path direct to Nutanix ADSF’s I/O engine called “stargate”.

For more information about configuring virtual storage controllers and multiple virtual disks, see: SQL & Exchange performance in a Virtual Machine which goes through the process step by step.

At this stage we’ve learned to get optimal performance, we need to use multiple virtual disks regardless of hypervisor, and for traditional hypervisors (ESXi & Hyper-V) we need to assign/configure multiple PVSCSI adapters and spread out virtual disks evenly across them.

Let’s say you have a VM running a monster SQL workload, and the nodes/cluster has been sized correctly and the active working set (data) resides 100% in the SSD tier or it’s an all flash cluster.

The VM is also running on AHV enjoying Turbo Mode (or ESXi/Hyper-V with four PVSCSI controllers) and you’ve added 16 vDisks and spanned your database across the vDisks, BUT you still need more performance. What can we do next?

The good news is, Nutanix has lots of way’s to scale performance so let’s look at a few of them:

  • Increase the vCPU of the Nutanix Controller VM (CVM)

This is rarely required, but it’s important to understand that Nutanix is just software running inside a VM, so simply increasing the vCPUs assigned to the CVM gives more available power to drive front end I/O as well as background cluster functionality.

The CVM automatically allows N-2 of the CVMs vCPUs to stargate (the I/O engine) which means if you add more vCPUs to the CVM, you will get more potential front and back end IO.

If your application performance is being impacted due to the local CVM being saturated (firstly, well done as this is very rare), but adding say 2 more vCPUs to the CVM may be enough to alleviate the bottleneck and give you much improved performance. I’ve seen this situation before and for the relatively low “cost” of 2 vCPUs, it can be well wroth it.

It’s important to note you can increase the vCPUs of just a single CVM, multiple CVMs or all CVMs within the cluster depending on your requirements and cluster design. e.g.: In a mixed cluster of nodes with 22c processors and 10c processors, you may move the critical VMs to the nodes with 22c processors and increase the CVM by 2vCPUs while leaving the nodes with the smaller 10c processors at default CVM size. This would deliver increased performance for the entire cluster while the most benefits would be felt on the 22c nodes.

For those interested in the Pros and Cons of the CVM and it’s use of host resources, please review: Cost vs Reward for the Nutanix Controller VM (CVM)

  • Increase the vRAM of the Nutanix Controller VM (CVM)

Increasing the CVMs RAM is another quick and easy way to improve performance. The two main reasons adding RAM can improve performance is because part of the CVMs RAM acts as a read cache so depending on your application and dataset size, the additional read cache can make a huge difference.

The second reason is for CVM RAM allows additional medusa (metadata) cache which helps minimise read latency.

If you look at http://CVM_IP:2009/cache_stats (example below) and your “Range Cache Hit %” is 50%, Then you’re getting very good cache hits, whereas if it was just 5% then more RAM may result in significantly better read performance depending on the working set size.

The other critical factor for performance is the medusa cache. We want to see as close to 100% as possible for the “VDisk block map Cache Hit %” & “Extent group id map Cache Hit %”.

StargateCacheStats

The above is an example of a system which has an optimal CVM RAM size for the working set as the Range Cache Hit % of 50% and the “VDisk block map Cache Hit %” & “Extent group id map Cache Hit %” are both sitting consistently at 100%.

The above cache and medusa hit rates are from a test cluster and it was achieving the following performance for a database checksum task (100% read).

ExamplePerformanceWith100%MedusaHitRate

The key here is the very low read latency which peaked at 0.35ms and sustained around 0.18ms over the course of several hours.

Signs of insufficient CVM RAM can be inconsistent read latency so if you’re observing this issue, review http://CVM_IP:2009/cache_stats and contact support for advise on CVM RAM sizing.

Note: There is no “harm” in adding more CVM RAM as long as the CVM is sized within a NUMA node to ensure memory performance remains optimal, the only impact is less available RAM for other Virtual Machines.

Let’s recap where we’re at:

The VM is on AHV enjoying Turbo Mode (or ESXi/Hyper-V with four PVSCSI controllers) with 16 vDisks and spanned your database across the vDisks. We’ve increased the CVM vCPUs and verified we have 100% hit rates for medusa and respectable 50% read cache hit rates, but we still need more performance, what else can we do?

  • Add storage only nodes

The benefits of adding storage only nodes, especially to a busy cluster is not only immediate but obvious when we look at the total IOPS, read and write latency.

If you’ve not read my post titled “Scale out performance testing with Nutanix Storage Only Nodes” I will quickly recap it for you, but I recommend reading the full article.

In short, I ran an MS Exchange Jetstress workload on 4 VMs on an optimally configured 4 node hybrid (SSD+SATA) cluster and achieved the following results.

Jetstress4NodesSummary

Observations from the baseline test:

  1. We achieved the desired >1000 IOPS per VM
  2. Performance was consistent across all Jetstress instances
  3. Log writes were in the 1ms range as they were serviced by the ADSF Oplog (persistent write buffer)
  4. Database reads were on average just under 10ms which is well below the Microsoft recommended 20ms
  5. The Database creation time averaged 2hrs 24mins
  6. The duplication of 3 databases averaged 4hrs 17mins
  7. The database checksum took on average around 38mins

I then added 4 more nodes to the cluster and without making any changes to the Jetstress, virtual machine/s or the cluster configuration and the IOPS jumped by 2x!!

The results for each of the four Jetstress VMs are shown below including the average across the VMs for each of the difference metrics.

Jetstress8NodesSummary

In summary adding the 4 storage only nodes:

  • Achieved IOPS jumped by almost 2x
  • Log writes average latency was lower by 13%
  • Database write latency dropped by >20%
  • Database read latency dropped by almost 2x
  • The Database creation time was just under 15 mins faster
  • The duplication of 3 databases improved by almost 35 mins
  • The database checksum was 40 seconds faster.

As we can see from these results, adding storage only nodes can significantly increase the performance without any tuning. Had I tuned the Jetstress configuration, much higher performance and potentially lower read/write latency could have also been achieved.

In short, adding storage only nodes is a quick win for performance with the added advantage of increasing the resiliency and capacity of the cluster.

So we’ve now achieved much higher performance for our workload thanks to a combination of optimally configured VM, CVM and the addition of storage only nodes.

If at this stage you’re still not achieving the performance you require, you’re in the 1% where we may need to utilise Acropolis Block Service (ABS) to further improve performance.

  • Acropolis Block Services (ABS)

ABS was announced in 2016 to address the edge use cases as customers wanted to make Nutanix the standard platform for their datacenters, however they have not been able to realise this vision due to a number of reasons including:

  • The desire/requirement to re-use existing servers
  • Applications which are not virtual (for many reasons, mostly political)
  • Performance / Scalability of externally connected servers
  • Complexity including operational considerations of external iSCSI

For more detailed information about the release please review: What’s .NEXT 2016 – Acropolis Block Services (ABS)

ABS works by using In-guest iSCSI to present vDisks direct to the Guest OS. The vDisks are then automatically load balanced across the entire Nutanix cluster to provide optimal performance.

The below tweet answers the FAQ around how distributed is the workload when using ABS. As we see below, a 4 node cluster uses 4 paths and when the cluster is expanded to 8 nodes ABS automatically (and almost instantly) expands to use 8 paths (or CVMs).

The downside of ABS is the loss of data locality, but if we can’t have data locality, the next best thing is a highly scalable, resilient and dynamic distributed storage fabric.

ABS can scale performance in a linear manner which is only limited by the network bandwidth and number of nodes to drive the IO, so a physical server with say 100GB NICs and a cluster of 32 nodes would produce ridiculous levels of performance in the multi-millions of IOPS range.

The In-guest iSCSI setup is also very simple, just set the iSCSI Target as the Nutanix Cluster IP and the load balancing is dynamically calculated, when the cluster size increases, the vDisks are automatically balanced across the new nodes without user intervention, the same is true for node removals, maintenance, upgrades, failures etc. Everything is managed automatically so ABS is a very simple iSCSI implementation for admins.

Summary:

Nutanix provides excellent scalability for Virtual Machines and provides ABS for niche workloads which may require more performance than a single node can offer.

Up next, Part 4 where we cover the latest and most exciting development for scaling storage Performance for Monster VMs.

Back to the Scalability, Resiliency and Performance Index.

Nutanix X-Ray Benchmarking tool – Snapshot Impact Scenario

In the first part of this series, I introduced Nutanix X-Ray benchmarking tool which has been designed very differently to traditional benchmarking tools as the performance of the app is the control and the variable is the platform,not the other way around.

This is done by generating realistic IO patterns (e.g.: Not 100% 4k read) and then performing functions against the platform to see how the control (the VM application performance) is impacted by the underlying platforms functionality.

A great example of this is performing snapshots as the first step in a space efficient backup solution.

X-Ray has a built in test which generates an OLTP workload which is ran for 8 hours which for an all flash platform generates 6000 IOPS across the database and 400 IOPS for the logs. The scenario is detailed in the X-Ray report shown below.

XraySnapshotImpactDescription

The Snapshot impact scenario is then ran against multiple platforms and using the Analysis functionality within X-ray. we can generate a report which overlays the results from multiple platforms.

The below example is GA Acropolis Hypervisor (AHV) on AOS 5.1.1 verses a leading hypervisor and SDS platform showing the snapshot impact scenario.

XraySnapshotImpact

Each of the red lines indicate a snapshot and what we observe is the performance of both platforms remains consistent until the 10th snapshot (shown below) where the Nutanix platform continues without impact and the leading hypervisor and SDS platform starts degrading significantly.

XraySnapshotImpactSnap10

In the real world, customers use the intelligent features of storage, SDS or hyper-converged platforms but rarely test how this functionality works prior to purchasing. This is because it’s difficult and time consuming to do so.

Nutanix X-Ray tool makes the process of validating a platforms performance under real world scenarios a quick and easy process and provides automatically generated reports where accurate comparisons can be made.

What this example shows is that while both platforms could achieve the required performance without snapshots, only Nutanix AHV & AOS could maintain the performance while utilising snapshots to achieve the type of recovery point objective (RPO) that is expected in production environments, especially with business critical workloads.

As part of the Nutanix Solutions and Performance engineering organisation, I can tell you that the focus for Nutanix is real world performance, using data reduction, leveraging snapshots, mixing workloads and testing a large scale.

In upcoming posts I will show more examples of X-Ray test scenarios as well as comparisons between GA Acropolis Hypervisor (AHV) & AOS 5.1.1 verses a leading hypervisor and SDS platform.

Related Articles:

Nutanix X-Ray Benchmarking tool Part 1 – Introduction

Nutanix X-Ray Benchmarking tool Part 3 – Extended Node Failure Scenario

The truth about storage benchmarking

Recently I was asked to review some performance testing done by an external party and my initial impression was the performance was well below what I expected.

So over the weekend I setup a block in my lab to reproduce the tests to see if the results were firstly repeatable, and if so, what performance would I get with and without tuning.

The only significant difference between my hardware and the hardware used by the 3rd party was that I used old dual socket Ivy Bridge E5-2670 2.6Ghz 8c processors and the 3rd party had a much newer dual Broadwell E5-2640 v4 2.4Ghz processors.

If we compare the two processors using CPUBoss.com we see the following:

CPUboss1

Not surprisingly the Broadwell E5-2640 v4 processor is faster, but possibly less than you would expect with a 16.28% better PassMark per core, and in my opinion, the per core value quite importaint especially when considering business critical applications.

None the less, a 16.28% performance improvement per core will be a significant factor for a benchmark with Nutanix as the Controller VM (CVM) is powered by the CPU of the host.

I thought I would whip up a quick post about performance benchmarking to show how different performance results can be on the same hardware depending on just a few factors and why storage benchmarking, especially competitive benchmarking, cannot and should not be trusted when making purchasing decisions.

This test was for a 10k user MS Exchange deployment and the hardware used for testing was performed on in both cases was 1 x 1.92TB SSD and 3 x 4TB SATA drives and they were both tested on the same GA Nutanix AOS build.

The required (or Target) IOPS was just 216 per MS Exchange instance (VM) as shown below by the Jetstress report.

TargetIO

This target is calculated by Jetstress when using the “Exchange Mailbox Profile” test scenario with the following configuration:

MailboxProfile2

The resulting SSD vs SATA ratio makes this test largely about the limitations of SATA performance as >87% of data is being read from the SATA tier.

TierUsage

Test 1: The Jetstress dataset was created and then the performance test was immediately ran for 2hrs with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 200.663
Avg Log Write Latency: 1.06ms
Avg DB Write Latency: 1.4ms
Avg DB Read Latency: 14ms

This result was 15.61% lower than the 3rd parties result and interestingly if we correct for CPU core performance, it’s less than 1% difference. As this was in line with my expectation knowing the importance of CPU clock speed, I would say for this testing that the baseline results were comparable.

Test 2: The Nutanix tiering was tuned to suit large working sets (which vastly exceed the SSD tier) and the Jetstress dataset was created and the performance test was immediately ran for 2hrs again with no pre-warming of the metadata or read cache.

Before we get to the results, I want to point out that Jetstress is in some ways is very good but in other ways a very unrealistic benchmarking tool as the entire dataset is “active” which is not the case in the real world. However, in one way this is a good thing because a passing Jetstress result in my experience means the production deployment performs very well from a storage perspective especially when using tiered storage which is built around the assumption not all data is active. As a result, a Jetstress test could be considered a “worse case scenario” style test for intelligent tiered storage.

Achieved Transactional I/O: 249.623
Avg Log Write Latency: 0.99ms
Avg DB Write Latency: 1.5ms
Avg DB Read Latency: 12ms

Test 3: I then setup Jetstress as per Nutanix MS Exchange best practices and ran the test again with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 389.753
Avg Log Write Latency: 0.95ms
Avg DB Write Latency: 2.0ms
Avg DB Read Latency: 17ms

Test 4: I then lowered the Jetstress thread count to the lowest value (roughly 33% lower) which I estimated would achieve the target IOPS (this is to simulate real world requirements) and ran the test again with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 300.254
Avg Log Write Latency: 0.94ms
Avg DB Write Latency: 1.5ms
Avg DB Read Latency: 12ms

Note: Test 4 achieved the highest I/O per thread.

Test 5: The same configuration as Test 4 but with pre-warming of the metadata cache.

Achieved Transactional I/O: 334
Avg Log Write Latency: 0.98ms
Avg DB Write Latency: 1.9ms
Avg DB Read Latency: 12.4ms

Some of you might be asking, how did test 4 achieve higher transactional I/O and with lower read and write latency than Test 1 & 2 with less threads. Shouldn’t a higher thread count achieve higher IOPS?

The reasons is because the original thread count was pushing the SATA drives past their capabilities, leading to excessive latency. Lowering the thread count allowed the SATA drives to operate at somewhere around their most efficiency range leading to lower latency.

Test 6: The same configuration as Test 5 but with tuned extent cache (RAM read cache) and 100% medadata cached.

Achieved Transactional I/O: 362.729ms
Avg Log Write Latency: 0.92ms
Avg DB Write Latency: 1.7ms
Avg DB Read Latency: 12ms

As we can see from Test 1 through to Test 6, the performance differs by up to 81% depending on how the platform is configured.

Side note and future looking statement. Many of the optimisations I performed above wont be required for long as many of the areas these optimisations help improve are being addressed in upcoming code. In saying that, for a business critical application like Exchange, I don’t think it’s a problem doing some optimisation as long as 90% of the workloads run well by default and we’re only tuning for the 10% (vBCA) workloads.

But out of interest, what would happen if we enabled data reduction? How much of a performance hit would that take?

Test 7: The same configuration as Test 6 but with In-line compression enabled.

Achieved Transactional I/O: 751.275
Avg Log Write Latency: 0.97ms
Avg DB Write Latency: 3.4ms
Avg DB Read Latency: 5.9ms

That’s a 107.46% increase in transactional IO and with in-line compression! Log write latency remained sub ms and read latency has almost halved.

Note: As Jetstress data is highly compressible, (Nutanix achieves 8:1 or higher with non default settings), I tuned the compression slice size to give a more realistic data reduction ratio. The ratio for this test was 3.99:1 and the ratio of SSD to SATA was almost exactly 50% as shown below.

TierUsageAfterCompression

Why did performance improve so much with In-Line compression? Well there is two main reasons:

  1. More data is being served from the SSD tier as compression allows more effective SSD tier capacity.
  2. Reads from SATA are faster as less physical data needs to be read to service an I/O due to it being compressed. The higher the compression ratio, the more this can improve.

As we can see, the results varied significantly and had I wanted to optimise the test further, I could have achieved even higher performance but there was no need. The requirements for the solution were already achieved and in the case of Test 7, the requirements were exceeded by 247% meaning the solution had heaps of headroom.

Nutanix best practice is to enable In-line compression for MS Exchange and other databases such as Oracle and SQL as per my tweet below.

This testing was performed on Nutanix Acropolis Hypervisor (AHV) but was not using the upcoming Turbo mode, which will further improve performance and lower overheads.

This is a key point many people forget when benchmarking. If we assume the platforms in question are scalable (e.g.: Like Nutanix), it doesn’t matter if one platform does 100k IOPS and another does 200k IOPS if your requirements are 20k IOPS. Both platforms capabilities vastly exceed the requirement (10k IOPS) from a performance perspective, so performance is not longer a significant factor in your purchasing decision.

Question: Are the above performance results genuine?

All of the above results could be argued to be genuine results, at the same time none of the above represent the best performance that could be achieved, yet the results could be used to try and create FUD if they are improperly represented (which is almost always the case with competitive comparisons whether intentional or otherwise).

Let’s say this was your proof of concept, What should be the take away from benchmarking results like this?

Simple: The solution meets/exceeds your performance requirements.

Now for the point of this article: The truth about storage benchmarking is that there are so many variables that can affect the results that unless you’re truely experienced in benchmarking your applications AND an expert in the platforms you’re benchmarking, your results are unlikely to be indicative of the platforms capabilities and therefore of very little value.

If you’re benchmarking Vendor A vs Vendor B, it’s a waste of time doing “Like for Like” benchmarking because the Virtual machine and application settings which are optimal for one vendor, will likely be different for the other vendor. e.g.: SAN vs HCI.

On the other hand, a more valid test would be vendor A’s best practices vs vendor B best practices, but again if one vendor Jetstress achieves 500 and the other achieves 400, that 20% higher performance is all but irrelevant if your requirements are say, 216 like in this case.

A very good example of invalid “like for like” benchmarking would be to size the active working set (i.e.: The capacity of the data you plan to benchmark against) to fit within the cache/SSD tier of one platform, but exceed the cache/SSD capacity of the other platform. The results will be vastly different and will not be indicative of real world performance. This is what vendors do when competitive benchmarking and it’s likely one of the main reasons we see End User License Agreements (EULA) from most if not all storage vendors preventing publishing benchmark results without written agreement.

So the (unpopular) truth about storage benchmarking is it’s not as easy and building a VM and running I/O meter with the same profile on multiple system like some vendors and even 3rd party storage analysts would have you believe. The vast majority of people (customers, analysts and even vendors) doing benchmarking don’t have the skill/experience to produce repeatable or meaningful results, especially on multiple platforms.

In fact it’s unrealistic/unreasonable to expect a person (customer, vendor, consultant) to be an expert in multiple platforms, and very few people are!

Related Articles:

  1. Peak Performance vs Real World Performance
  2. The Key to performance is Consistency