Predictable & Scalable MS Exchange 2016 Performance on Nutanix with AHV

I’ve been doing some testing recently with Nutanix latest GA code (AOS 5.8) and I decided to do some quick MS Exchange Jetstress performance tests as part of a larger piece of work.

In short I wanted to check how well Exchange storage performance scaled so I performed three tests. I started with 4 threads, then increased to 8 and finally to 12 threads using Jetress with Exchange 2016 ESE database modules.

For this testing I disabled the Nutanix in memory read cache to ensure all read IO is serviced by the physical SSDs so the result is not artificially improved from cache.

I also disabled Compression, Erasure Coding and Deduplication as these also artificially improve performance due to Jetstress data being highly compressible & dedupable.

The hardware used was a NX-8150 with 6 x SSDs and Intel Broadwell processors. This is why the database size was only 1.7TB as that’s just below the total usable capacity of the node. The performance over larger database sizes remains the same when the metadata cache (in the Nutanix Controller VM) is sized for the desired working set size as shown by our ESRP certification.

The hypervisor is Acropolis Hypervisor (AHV) which is fully certified for Microsoft Windows under the MS SVVP programme as well as MS ESRP certified for MS Exchange.

So here is the result for 4 threads.

Jetstress2016_4Threads

5580 IOPS with just 4 threads is very good performance and is sufficient for at least five thousand mailboxes with hundreds of messages per day which is maximum recommended active users per Exchange MSR server.

The next question is: What’s the latency for the database reads and log writes? (These are two of the critical performance metrics for Jetstress Pass/Fail results)

Jetstress2016_4Threads_Latency

Here we can see log write latency average across all four log drives is below 1ms (0.99ms) and database read latency at 1.16ms.

Next up, here is the result for 8 threads.

Jetstress2016_8Threads

10147 IOPS with 8 threads is excellent performance and shows Nutanix easily has headroom for more than ten-thousand mailboxes with hundreds of messages per day which easily exceeds the requirements for the maximum recommended active users per Exchange MSR server.

Again let’s check out the latency, Here we can see log write latency average across all four log drives is still below 1ms (0.99ms) and database read latency at 1.29ms. That’s just 0.13ms higher latency for reads and exactly the same write latency while achieving almost DOUBLE the IOPS.

Jetstress2016_8Threads_Latency

Lastly here is the result for 12 threads.

Jetstress2016_12Threads

14351 IOPS with 12 threads proves how scalable the Nutanix platform is as this is almost a linear increase in IOPS.

Again let’s check out the latency, Here we can see log write latency average across all four log drives is still below 1ms (0.98ms) and database read latency at 1.42ms. That’s just 0.14ms higher latency for reads and slightly lower write latency while achieving almost linear improvement in IOPS.

Jetstress2016_12Threads_Latency

Summary:

Nutanix provides extremely high, predictable performance for even the most demanding MS Exchange environments.

 

The truth about storage benchmarking

Recently I was asked to review some performance testing done by an external party and my initial impression was the performance was well below what I expected.

So over the weekend I setup a block in my lab to reproduce the tests to see if the results were firstly repeatable, and if so, what performance would I get with and without tuning.

The only significant difference between my hardware and the hardware used by the 3rd party was that I used old dual socket Ivy Bridge E5-2670 2.6Ghz 8c processors and the 3rd party had a much newer dual Broadwell E5-2640 v4 2.4Ghz processors.

If we compare the two processors using CPUBoss.com we see the following:

CPUboss1

Not surprisingly the Broadwell E5-2640 v4 processor is faster, but possibly less than you would expect with a 16.28% better PassMark per core, and in my opinion, the per core value quite importaint especially when considering business critical applications.

None the less, a 16.28% performance improvement per core will be a significant factor for a benchmark with Nutanix as the Controller VM (CVM) is powered by the CPU of the host.

I thought I would whip up a quick post about performance benchmarking to show how different performance results can be on the same hardware depending on just a few factors and why storage benchmarking, especially competitive benchmarking, cannot and should not be trusted when making purchasing decisions.

This test was for a 10k user MS Exchange deployment and the hardware used for testing was performed on in both cases was 1 x 1.92TB SSD and 3 x 4TB SATA drives and they were both tested on the same GA Nutanix AOS build.

The required (or Target) IOPS was just 216 per MS Exchange instance (VM) as shown below by the Jetstress report.

TargetIO

This target is calculated by Jetstress when using the “Exchange Mailbox Profile” test scenario with the following configuration:

MailboxProfile2

The resulting SSD vs SATA ratio makes this test largely about the limitations of SATA performance as >87% of data is being read from the SATA tier.

TierUsage

Test 1: The Jetstress dataset was created and then the performance test was immediately ran for 2hrs with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 200.663
Avg Log Write Latency: 1.06ms
Avg DB Write Latency: 1.4ms
Avg DB Read Latency: 14ms

This result was 15.61% lower than the 3rd parties result and interestingly if we correct for CPU core performance, it’s less than 1% difference. As this was in line with my expectation knowing the importance of CPU clock speed, I would say for this testing that the baseline results were comparable.

Test 2: The Nutanix tiering was tuned to suit large working sets (which vastly exceed the SSD tier) and the Jetstress dataset was created and the performance test was immediately ran for 2hrs again with no pre-warming of the metadata or read cache.

Before we get to the results, I want to point out that Jetstress is in some ways is very good but in other ways a very unrealistic benchmarking tool as the entire dataset is “active” which is not the case in the real world. However, in one way this is a good thing because a passing Jetstress result in my experience means the production deployment performs very well from a storage perspective especially when using tiered storage which is built around the assumption not all data is active. As a result, a Jetstress test could be considered a “worse case scenario” style test for intelligent tiered storage.

Achieved Transactional I/O: 249.623
Avg Log Write Latency: 0.99ms
Avg DB Write Latency: 1.5ms
Avg DB Read Latency: 12ms

Test 3: I then setup Jetstress as per Nutanix MS Exchange best practices and ran the test again with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 389.753
Avg Log Write Latency: 0.95ms
Avg DB Write Latency: 2.0ms
Avg DB Read Latency: 17ms

Test 4: I then lowered the Jetstress thread count to the lowest value (roughly 33% lower) which I estimated would achieve the target IOPS (this is to simulate real world requirements) and ran the test again with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 300.254
Avg Log Write Latency: 0.94ms
Avg DB Write Latency: 1.5ms
Avg DB Read Latency: 12ms

Note: Test 4 achieved the highest I/O per thread.

Test 5: The same configuration as Test 4 but with pre-warming of the metadata cache.

Achieved Transactional I/O: 334
Avg Log Write Latency: 0.98ms
Avg DB Write Latency: 1.9ms
Avg DB Read Latency: 12.4ms

Some of you might be asking, how did test 4 achieve higher transactional I/O and with lower read and write latency than Test 1 & 2 with less threads. Shouldn’t a higher thread count achieve higher IOPS?

The reasons is because the original thread count was pushing the SATA drives past their capabilities, leading to excessive latency. Lowering the thread count allowed the SATA drives to operate at somewhere around their most efficiency range leading to lower latency.

Test 6: The same configuration as Test 5 but with tuned extent cache (RAM read cache) and 100% medadata cached.

Achieved Transactional I/O: 362.729ms
Avg Log Write Latency: 0.92ms
Avg DB Write Latency: 1.7ms
Avg DB Read Latency: 12ms

As we can see from Test 1 through to Test 6, the performance differs by up to 81% depending on how the platform is configured.

Side note and future looking statement. Many of the optimisations I performed above wont be required for long as many of the areas these optimisations help improve are being addressed in upcoming code. In saying that, for a business critical application like Exchange, I don’t think it’s a problem doing some optimisation as long as 90% of the workloads run well by default and we’re only tuning for the 10% (vBCA) workloads.

But out of interest, what would happen if we enabled data reduction? How much of a performance hit would that take?

Test 7: The same configuration as Test 6 but with In-line compression enabled.

Achieved Transactional I/O: 751.275
Avg Log Write Latency: 0.97ms
Avg DB Write Latency: 3.4ms
Avg DB Read Latency: 5.9ms

That’s a 107.46% increase in transactional IO and with in-line compression! Log write latency remained sub ms and read latency has almost halved.

Note: As Jetstress data is highly compressible, (Nutanix achieves 8:1 or higher with non default settings), I tuned the compression slice size to give a more realistic data reduction ratio. The ratio for this test was 3.99:1 and the ratio of SSD to SATA was almost exactly 50% as shown below.

TierUsageAfterCompression

Why did performance improve so much with In-Line compression? Well there is two main reasons:

  1. More data is being served from the SSD tier as compression allows more effective SSD tier capacity.
  2. Reads from SATA are faster as less physical data needs to be read to service an I/O due to it being compressed. The higher the compression ratio, the more this can improve.

As we can see, the results varied significantly and had I wanted to optimise the test further, I could have achieved even higher performance but there was no need. The requirements for the solution were already achieved and in the case of Test 7, the requirements were exceeded by 247% meaning the solution had heaps of headroom.

Nutanix best practice is to enable In-line compression for MS Exchange and other databases such as Oracle and SQL as per my tweet below.

This testing was performed on Nutanix Acropolis Hypervisor (AHV) but was not using the upcoming Turbo mode, which will further improve performance and lower overheads.

This is a key point many people forget when benchmarking. If we assume the platforms in question are scalable (e.g.: Like Nutanix), it doesn’t matter if one platform does 100k IOPS and another does 200k IOPS if your requirements are 20k IOPS. Both platforms capabilities vastly exceed the requirement (10k IOPS) from a performance perspective, so performance is not longer a significant factor in your purchasing decision.

Question: Are the above performance results genuine?

All of the above results could be argued to be genuine results, at the same time none of the above represent the best performance that could be achieved, yet the results could be used to try and create FUD if they are improperly represented (which is almost always the case with competitive comparisons whether intentional or otherwise).

Let’s say this was your proof of concept, What should be the take away from benchmarking results like this?

Simple: The solution meets/exceeds your performance requirements.

Now for the point of this article: The truth about storage benchmarking is that there are so many variables that can affect the results that unless you’re truely experienced in benchmarking your applications AND an expert in the platforms you’re benchmarking, your results are unlikely to be indicative of the platforms capabilities and therefore of very little value.

If you’re benchmarking Vendor A vs Vendor B, it’s a waste of time doing “Like for Like” benchmarking because the Virtual machine and application settings which are optimal for one vendor, will likely be different for the other vendor. e.g.: SAN vs HCI.

On the other hand, a more valid test would be vendor A’s best practices vs vendor B best practices, but again if one vendor Jetstress achieves 500 and the other achieves 400, that 20% higher performance is all but irrelevant if your requirements are say, 216 like in this case.

A very good example of invalid “like for like” benchmarking would be to size the active working set (i.e.: The capacity of the data you plan to benchmark against) to fit within the cache/SSD tier of one platform, but exceed the cache/SSD capacity of the other platform. The results will be vastly different and will not be indicative of real world performance. This is what vendors do when competitive benchmarking and it’s likely one of the main reasons we see End User License Agreements (EULA) from most if not all storage vendors preventing publishing benchmark results without written agreement.

So the (unpopular) truth about storage benchmarking is it’s not as easy and building a VM and running I/O meter with the same profile on multiple system like some vendors and even 3rd party storage analysts would have you believe. The vast majority of people (customers, analysts and even vendors) doing benchmarking don’t have the skill/experience to produce repeatable or meaningful results, especially on multiple platforms.

In fact it’s unrealistic/unreasonable to expect a person (customer, vendor, consultant) to be an expert in multiple platforms, and very few people are!

Related Articles:

  1. Peak Performance vs Real World Performance
  2. The Key to performance is Consistency

Scale out performance testing with Nutanix Storage Only Nodes

At Nutanix inaugural user conference in 2015, Storage Only nodes were announced which allowed customers for the first time to scale capacity without having to add compute nodes. This allows customers more flexibility and eliminates the need to license the storage nodes for vSphere as storage only nodes run Acropolis Hypervisor (AHV) and are managed entirely through PRISM.

A common question from prospective and existing Nutanix customers is what if my VMs storage exceeds the capacity of a Nutanix node? The answer is detailed in this blog post but in short, as the Acropolis Distributed Storage Fabric (ADSF) distributes data throughout the cluster at a 1MB granularity, a VMs storage can exceed the local node and performance even improves including reads from the capacity (SAS/SATA) tier.

Storage only nodes were previously limited to the NX-6035C (and Dell XC/Lenovo HX equivalents) but at Nutanix .NEXT conference in Las Vegas 2016, it was announced that any node (including all-flash) can be a storage only node.

This means even for high performance and/or high capacity environments, Nutanix clusters can be scaled without the need to add compute node or purchase additional licensing if you are running vSphere as the hypervisor.

However to date Nutanix are yet to publish any performance data showing the value of storage only nodes, so I decided to run a few tests and demonstrate the value of the Acropolis Distributed Storage Fabric (ADSF) and Storage Only Nodes.

Before we get to the performance data, to avoid competitors inevitable attempts to create FUD about Nutanix performance, I will not be publishing the exact specifications of the node types, drive or Jetstress configurations. I will be publishing the IOPS/latency and database creation, duplication and checksumming durations of the direct comparisons which clearly show the performance advantage of storage only nodes.

Jetstress was not configured to demonstrate maximum performance of the underlying Nutanix solution, it was configured to achieve around 1000 IOPS which is typically higher than even a large Exchange deployment requires per instance. This also allows this test to demonstrate how performance improves when the cluster is performing real world levels of IO (at least in the case of Exchange for this example).

The performance advantage will vary between node types and based on how many storage only nodes are added to the cluster. But the point of this example is to show that ADSF is a truely distributed storage fabric and the storage only nodes and additional Nutanix Controller VMs (CVMs) servicing replication (RF) traffic and remote reads significantly improves performance for VMs residing on the Compute+Storage nodes.

Test Overview:

The first test will be performed using four Jetstress VMs running on a four node cluster. The second test will be performed after an additional four storage only nodes are added to the cluster to form an eight node cluster. Before the second test the cluster will be wiped of all data with the exception of the Windows 2012 R2 template and all Jetstress DBs will be created from scratch so we can compare DB creation as well as performance and DB checksumming durations. Wiping all data also ensures there is no pre-warming of the extent cache (in memory read cache) or metadata cache.

Test Preparation:

I performed a cluster stop / cluster destroy / cluster create to ensure the cluster is totally clean and that we have a fair baseline for the test. The cluster was made up of four nodes.

I then created a base Windows 2012 R2 virtual machine with 4 PVSCSI adapters and 9 vDisks, one for the OS, 4 for the DBs and 4 for the logs. DB drives were formatted with 64k allocation size and log drives with 4k as the different allocation size and seperate virtual disks has shown approx 25% performance improvement in my testing not to mention I recommend In-Line compression and Erasure Coding (EC-X) for Exchange databases and no data reduction for logs.

Jetstress was configured to use 80% of the vDisks capacity which resulted in approx 80% of the Nutanix storage pool capacity being utilised for the test. I will point out these were not low capacity nodes such as NX3060s so the database creation time is significant because there was lots of data to create.

I then cloned the VM 3 times and spread the 4 VMs across 4 Nutanix Nodes running ESXi 5.5 Update 3.

Test 1: Create Databases and run 2hr test

The databases creation phase creates one database, then Jetstress duplicates the database in this case 3 times and immediately after creation the performance test begins.

Note: No data reduction was used for this test as it will result in unrealistic data reduction and performance results as I described in the post Jetstress Testing with Intelligent Tiered Storage Platforms.

I configured Jetstress in this way to ensure the extent cache (in memory read cache) was not pre-warmed and so the results of the test would be fair and repeatable.

Once the performance test completed, I waited for each test to complete before allowing the database checksum validation task to complete. (This is done by using the Multi-host option in Jetstress).

The results for each of the four Jetstress VMs are shown below including the average across the VMs for each of the difference metrics.

Jetstress4NodesSummary

Observations from Test 1:

  1. We achieved the desired >1000 IOPS per VM
  2. Performance was consistent across all Jetstress instances
  3. Log writes were in the 1ms range as they were serviced by the ADSF Oplog (persistent write buffer)
  4. Database reads were on average just under 10ms which is well below the Microsoft recommended 20ms
  5. The Database creation time averaged 2hrs 24mins
  6. The duplication of 3 databases averaged 4hrs 17mins
  7. The database checksum took on average around 38mins

Test 2: Delete all data, Add four nodes to the cluster & repeat test 1

All Jetstress VMs were deleted and a full curator scan manually initiated to ensure all data was fully removed from disk prior to beginning the next test which ensured a fair baseline.

Four Jetstress VMs were then deployed from the same template, powered on and the saved Jetstress configuration was applied before beginning the test.

Note: The Jetstress thread count was not changed and remains the same as for Test 1.

As with Test 1 the databases creation phase created one database, then Jetstress duplicates the database 3 times and immediately after creation the performance test begins and ran for the same 2hr duration.

The results for each of the four Jetstress VMs are shown below including the average across the VMs for each of the difference metrics.

Jetstress8NodesSummary

Observations from Test 2:

  1. Achieved IOPS jumped by almost 2x
  2. Log writes average latency was lower by 13%
  3. Database write latency dropped by >20%
  4. Database read latency dropped by almost 2x
  5. The Database creation time was just under 15 mins faster
  6. The duplication of 3 databases improved by almost 35 mins
  7. The database checksum was 40 seconds faster.

Without changing the Jetstress thread count, due to the improved performance of the cluster the achieved IOPS jumped by 2x!!

Summary:

These tests is a clear demonstration of the scalability advantage of the Acropolis Distributed Storage Fabric (ADSF) and storage only nodes for customers wanting to increase performance and/or capacity in their HCI environment.

The ability of ADSF to distribute write IO across all nodes within a cluster means write performance improves significantly with the addition of nodes (including storage only) to the cluster while reducing read and write latency due to the decreased workload on the compute + storage nodes servicing the VMs.

But data locality is lost with storage only nodes, right?

Wrong! Storage only nodes actually improve (yes, improve!) data locality by maximising the amount of available space on the compute+storage nodes. This is as a direct result of storage only nodes accepting replication data for write IO and storing the 2nd or 3rd copies (in the case of RF3) on the storage only nodes. This is also demonstrated by the lower read latency observed during this test.

Storage only nodes not only improve the performance and capacity for Virtual machines, but also for physical servers using Acropolis Block Services (ABS) and users of Acropolis File Services (AFS) both of which had enhancements announced at .NEXT 2016 this year.