The truth about storage benchmarking

Recently I was asked to review some performance testing done by an external party and my initial impression was the performance was well below what I expected.

So over the weekend I setup a block in my lab to reproduce the tests to see if the results were firstly repeatable, and if so, what performance would I get with and without tuning.

The only significant difference between my hardware and the hardware used by the 3rd party was that I used old dual socket Ivy Bridge E5-2670 2.6Ghz 8c processors and the 3rd party had a much newer dual Broadwell E5-2640 v4 2.4Ghz processors.

If we compare the two processors using we see the following:


Not surprisingly the Broadwell E5-2640 v4 processor is faster, but possibly less than you would expect with a 16.28% better PassMark per core, and in my opinion, the per core value quite importaint especially when considering business critical applications.

None the less, a 16.28% performance improvement per core will be a significant factor for a benchmark with Nutanix as the Controller VM (CVM) is powered by the CPU of the host.

I thought I would whip up a quick post about performance benchmarking to show how different performance results can be on the same hardware depending on just a few factors and why storage benchmarking, especially competitive benchmarking, cannot and should not be trusted when making purchasing decisions.

This test was for a 10k user MS Exchange deployment and the hardware used for testing was performed on in both cases was 1 x 1.92TB SSD and 3 x 4TB SATA drives and they were both tested on the same GA Nutanix AOS build.

The required (or Target) IOPS was just 216 per MS Exchange instance (VM) as shown below by the Jetstress report.


This target is calculated by Jetstress when using the “Exchange Mailbox Profile” test scenario with the following configuration:


The resulting SSD vs SATA ratio makes this test largely about the limitations of SATA performance as >87% of data is being read from the SATA tier.


Test 1: The Jetstress dataset was created and then the performance test was immediately ran for 2hrs with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 200.663
Avg Log Write Latency: 1.06ms
Avg DB Write Latency: 1.4ms
Avg DB Read Latency: 14ms

This result was 15.61% lower than the 3rd parties result and interestingly if we correct for CPU core performance, it’s less than 1% difference. As this was in line with my expectation knowing the importance of CPU clock speed, I would say for this testing that the baseline results were comparable.

Test 2: The Nutanix tiering was tuned to suit large working sets (which vastly exceed the SSD tier) and the Jetstress dataset was created and the performance test was immediately ran for 2hrs again with no pre-warming of the metadata or read cache.

Before we get to the results, I want to point out that Jetstress is in some ways is very good but in other ways a very unrealistic benchmarking tool as the entire dataset is “active” which is not the case in the real world. However, in one way this is a good thing because a passing Jetstress result in my experience means the production deployment performs very well from a storage perspective especially when using tiered storage which is built around the assumption not all data is active. As a result, a Jetstress test could be considered a “worse case scenario” style test for intelligent tiered storage.

Achieved Transactional I/O: 249.623
Avg Log Write Latency: 0.99ms
Avg DB Write Latency: 1.5ms
Avg DB Read Latency: 12ms

Test 3: I then setup Jetstress as per Nutanix MS Exchange best practices and ran the test again with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 389.753
Avg Log Write Latency: 0.95ms
Avg DB Write Latency: 2.0ms
Avg DB Read Latency: 17ms

Test 4: I then lowered the Jetstress thread count to the lowest value (roughly 33% lower) which I estimated would achieve the target IOPS (this is to simulate real world requirements) and ran the test again with no pre-warming of the metadata or read cache.

Achieved Transactional I/O: 300.254
Avg Log Write Latency: 0.94ms
Avg DB Write Latency: 1.5ms
Avg DB Read Latency: 12ms

Note: Test 4 achieved the highest I/O per thread.

Test 5: The same configuration as Test 4 but with pre-warming of the metadata cache.

Achieved Transactional I/O: 334
Avg Log Write Latency: 0.98ms
Avg DB Write Latency: 1.9ms
Avg DB Read Latency: 12.4ms

Some of you might be asking, how did test 4 achieve higher transactional I/O and with lower read and write latency than Test 1 & 2 with less threads. Shouldn’t a higher thread count achieve higher IOPS?

The reasons is because the original thread count was pushing the SATA drives past their capabilities, leading to excessive latency. Lowering the thread count allowed the SATA drives to operate at somewhere around their most efficiency range leading to lower latency.

Test 6: The same configuration as Test 5 but with tuned extent cache (RAM read cache) and 100% medadata cached.

Achieved Transactional I/O: 362.729ms
Avg Log Write Latency: 0.92ms
Avg DB Write Latency: 1.7ms
Avg DB Read Latency: 12ms

As we can see from Test 1 through to Test 6, the performance differs by up to 81% depending on how the platform is configured.

Side note and future looking statement. Many of the optimisations I performed above wont be required for long as many of the areas these optimisations help improve are being addressed in upcoming code. In saying that, for a business critical application like Exchange, I don’t think it’s a problem doing some optimisation as long as 90% of the workloads run well by default and we’re only tuning for the 10% (vBCA) workloads.

But out of interest, what would happen if we enabled data reduction? How much of a performance hit would that take?

Test 7: The same configuration as Test 6 but with In-line compression enabled.

Achieved Transactional I/O: 751.275
Avg Log Write Latency: 0.97ms
Avg DB Write Latency: 3.4ms
Avg DB Read Latency: 5.9ms

That’s a 107.46% increase in transactional IO and with in-line compression! Log write latency remained sub ms and read latency has almost halved.

Note: As Jetstress data is highly compressible, (Nutanix achieves 8:1 or higher with non default settings), I tuned the compression slice size to give a more realistic data reduction ratio. The ratio for this test was 3.99:1 and the ratio of SSD to SATA was almost exactly 50% as shown below.


Why did performance improve so much with In-Line compression? Well there is two main reasons:

  1. More data is being served from the SSD tier as compression allows more effective SSD tier capacity.
  2. Reads from SATA are faster as less physical data needs to be read to service an I/O due to it being compressed. The higher the compression ratio, the more this can improve.

As we can see, the results varied significantly and had I wanted to optimise the test further, I could have achieved even higher performance but there was no need. The requirements for the solution were already achieved and in the case of Test 7, the requirements were exceeded by 247% meaning the solution had heaps of headroom.

Nutanix best practice is to enable In-line compression for MS Exchange and other databases such as Oracle and SQL as per my tweet below.

This testing was performed on Nutanix Acropolis Hypervisor (AHV) but was not using the upcoming Turbo mode, which will further improve performance and lower overheads.

This is a key point many people forget when benchmarking. If we assume the platforms in question are scalable (e.g.: Like Nutanix), it doesn’t matter if one platform does 100k IOPS and another does 200k IOPS if your requirements are 20k IOPS. Both platforms capabilities vastly exceed the requirement (10k IOPS) from a performance perspective, so performance is not longer a significant factor in your purchasing decision.

Question: Are the above performance results genuine?

All of the above results could be argued to be genuine results, at the same time none of the above represent the best performance that could be achieved, yet the results could be used to try and create FUD if they are improperly represented (which is almost always the case with competitive comparisons whether intentional or otherwise).

Let’s say this was your proof of concept, What should be the take away from benchmarking results like this?

Simple: The solution meets/exceeds your performance requirements.

Now for the point of this article: The truth about storage benchmarking is that there are so many variables that can affect the results that unless you’re truely experienced in benchmarking your applications AND an expert in the platforms you’re benchmarking, your results are unlikely to be indicative of the platforms capabilities and therefore of very little value.

If you’re benchmarking Vendor A vs Vendor B, it’s a waste of time doing “Like for Like” benchmarking because the Virtual machine and application settings which are optimal for one vendor, will likely be different for the other vendor. e.g.: SAN vs HCI.

On the other hand, a more valid test would be vendor A’s best practices vs vendor B best practices, but again if one vendor Jetstress achieves 500 and the other achieves 400, that 20% higher performance is all but irrelevant if your requirements are say, 216 like in this case.

A very good example of invalid “like for like” benchmarking would be to size the active working set (i.e.: The capacity of the data you plan to benchmark against) to fit within the cache/SSD tier of one platform, but exceed the cache/SSD capacity of the other platform. The results will be vastly different and will not be indicative of real world performance. This is what vendors do when competitive benchmarking and it’s likely one of the main reasons we see End User License Agreements (EULA) from most if not all storage vendors preventing publishing benchmark results without written agreement.

So the (unpopular) truth about storage benchmarking is it’s not as easy and building a VM and running I/O meter with the same profile on multiple system like some vendors and even 3rd party storage analysts would have you believe. The vast majority of people (customers, analysts and even vendors) doing benchmarking don’t have the skill/experience to produce repeatable or meaningful results, especially on multiple platforms.

In fact it’s unrealistic/unreasonable to expect a person (customer, vendor, consultant) to be an expert in multiple platforms, and very few people are!

Related Articles:

  1. Peak Performance vs Real World Performance
  2. The Key to performance is Consistency

Storage Performance : ReFS vs NTFS

I am regularly asked by customers if they should use NTFS or the newer ReFS when formatting drives for applications like Microsoft Exchange and SQL.

Most customers are asking in the context of performance, so I thought I would share some recent testing results using MS Exchange Jetstress.

Firstly, what is ReFS and when/would you use ReFS?

What is ReFS?

Resilient File System (ReFS) is a new local file system. It maximizes data availability, despite errors that would historically cause data loss or downtime. Data integrity ensures that business critical data is protected from errors and available when needed. Its architecture is designed to provide scalability and performance in an era of constantly growing data set sizes and dynamic workloads.

The key features of ReFS are:

  • Integrity: ReFS stores data so that it is protected from many of the common errors that can cause data loss. File system metadata is always protected. Optionally, user data can be protected on a per-volume, per-directory, or per-file basis. If corruption occurs, ReFS can detect and, when configured with Storage Spaces, automatically correct the corruption. In the event of a system error, ReFS is designed to recover from that error rapidly, with no loss of user data.
  • Availability: ReFS is designed to prioritize the availability of data. With ReFS, if corruption occurs, and it cannot be repaired automatically, the online salvage process is localized to the area of corruption, requiring no volume down-time. In short, if corruption occurs, ReFS will stay online.
  • Scalability: ReFS is designed for the data set sizes of today and the data set sizes of tomorrow; it’s optimized for high scalability.
  • App Compatibility: To maximize AppCompat, ReFS supports a subset of NTFS features plus Win32 APIs that are widely adopted.
  • Proactive Error Identification: The integrity capabilities of ReFS are leveraged by a data integrity scanner (a “scrubber”) that periodically scans the volume, attempts to identify latent corruption, and then proactively triggers a repair of that corrupt data.

Source: Microsoft Technet – Resilient file system

From my perspective, ReFS makes sense when using physical servers with unintelligent storage such as JBOD or any storage which does not perform things such as checksums on both read and write IO and enforce Force Unit Access (FUA). However if you’re deploying MS Exchange / MS SQL etc on intelligent storage such as Nutanix Acropolis Distributed Storage Fabric (ADSF) then ReFS is not required as data integrity is already ensured by the storage layer. For example, in the event of silent data corruption, ADSF will detect the corruption on read and simply retrieve the data from the second copy which resides on a different physical drive on a different node within the cluster. This is also transparent to the Virtual Machine, OS and application and therefore compatible with any OS and application.

As a result ReFS (at least in its current version) is not required for deployments of Microsoft OS,Apps on Nutanix or other storage solutions if they have the same functionality.

None the less, this is not supposed to be a post about Nutanix, so let’s now look at the test bed and results of the performance comparison so you can make an informed decision about which to use

Test Bed Setup

The test bed setup is as follows:

Hypervisor: ESXi 5.5 Rel: 3248547

2 Virtual Machines cloned from the same template:
Windows 2012 R2 , 4 vCPUs , 24Gb RAM
4 Paravirtual SCSI adapters
1 vDisk for OS , 4 vDisks for DB, 4 vDisks for Logs

Both VMs are running on the same node, with only one VM running Jetstress at a time. All tests runs were back to back to ensure results would be fair and to check the consistency of the results.

The only difference between the two VMs is as follows:


4 vDisks formatted with NTFS and 64k allocation size for Database
4 vDisks formatted with NTFS and 4k allocation size for Logs


All 8 vDisks formatted with ReFS (64k)

Tests performed:

Three Jetstress runs per VM one after another, importantly with new databases created before each run to ensure a fair baseline. Doing this ensured the results were skewed by having the Extent Cache (In-Memory Read Cache) or the Medusa Cache (In-Memory Metadata Cache) pre-warmed.

Each run used 16 threads and resulted in the following results.

ReFS Jetstress Instance:

Run One: 6697 IOPS
Run Two: 6896 IOPS
Run Three: 6796 IOPS

Average: 6796 IOPS (approx +-3% between runs)

NTFS Jetstress Instance:

Run One: 7328 IOPS
Run Two: 7240 IOPS
Run Three: 7296 IOPS

Average: 7288 IOPS (approx +-1% between runs)


The difference being approx 7% higher performance and more consistency when using NTFS.

Additional Tests:

Out of interest I repeated the tests with a lower thread count (8) to see if the results were consistent as we decreased the threads.

8 Threads:

ReFS: 3921 IOPS

The result again went in favour of NTFS by approx 4%. This makes sense as the advantage would diminish as the pressure on the storage layer reduces.

Autotune Result:

I then repeated the test with Jetstress set to Autotune with the following results.

ReFS: 16673 IOPS @ 91 threads (Autotuned)
NTFS: 17758 IOPS @ 96 threads (Autotuned)

The autotune results again show that NTFS has an advantage over ReFS of approx 7% which is in line with the results using 16 threads manually configured.

CPU overheads comparisons

ReFS Jetstress Instance:

Run One:Avg 39.293% (Min 23.725 / Max 44.127)
Run Two:Avg 40.28% (Min 37.785 / Max 44.366)
Run Three: Avg 40.175% (Min 36.520 / Max 43.843)

Average: 39.916%

NTFS Jetstress Instance:

Run One: Avg 39.390% (Min 36.746 / Max 42.651)
Run Two: Avg 39.719% (Min 23.613 / Max 45.960)
Run Three:Avg 39.844% (Min 37.347 / Max 42.400)

Average: 39.651%

So NTFS achieved 7% better performance than ReFS using the same thread count even with the Data Integrity features turned off for ReFS volumes without using any more CPU.


Overall these tests demonstrate that NTFS consistently outperforms ReFS for MS Exchange type IO patterns. For intelligent storage, ReFS has no advantages and NTFS will provide better performance with roughly the same CPU overheads and without any risk of data integrity issues.

As the recommendation for ReFS is to disable the data integrity features for Exchange, I am yet to hear a good justification as to why ReFS is recommended, but I welcome any comments from those in the know and if the justifications are solid I will update the post to reflect these reasons.

Related Articles:

1. Jetstress Testing with Intelligent Tiered Storage Platforms

2. MS Exchange on Nutanix Acropolis Hypervisor (AHV)

3. How to successfully Virtualize MS Exchange

4. Deduplication and MS Exchange

MS Exchange on Nutanix now a MS validated ESRP solution

I am pleased to announce that Nutanix has successfully completed the Microsoft Exchange Solution Review Program requirements and are now listed as a validated solution at the following URL:

Exchange Solution Reviewed Program (ESRP) – Storage

The solution shows a dual site 24,000 1GB Mailbox solution running on just 8 NX-8150 nodes. This is a very highly resilient solution with N+1 availability at each site allowing for full self healing and failover in the event of a node failure.

Nutanix is also the FIRST and only hyper-converged platform to be validated under ESRP, further strengthening our leadership in the market.

The performance testing (using Jetstress) was with the nodes at around 90% capacity with 8.5TB per node, proving that Nutanix provided great performance even when running at high utilization and where the working set far exceeds the SSD tier. This is key to a truly enterprise solution for a business critical application such as Exchange.

The solution is running on Hyper-V with SMB 3.0 on the underlying Nutanix Distributed Storage Fabric. The same solution can also be deployed on vSphere or Acropolis Hypervisors using iSCSI in a fully supported configuration.

The above solution was validated without using Compression or Erasure Coding both of which improve performance and give significant capacity savings which allows for larger mailboxes. As a result, the Nutanix platform provides even more value than the ESRP submission shows.

If there was any doubt around if you should virtualize MS Exchange on Nutanix platform, the fact Nutanix is now validated by Microsoft should put your mind at ease.

Now you can move one step closer to a fully webscale datacenter by removing another application specific silo and enjoy improved resiliency/performance while reducing operational cost and complexity.