Nutanix Scalability – Part 5 – Scaling Storage Performance for Physical Machines

Part 3 and Part 4 has taught us that Nutanix provides excellent scalability for Virtual Machines and provides ABS and Volume Group Load Balancer (VG LB) for niche workloads which may require more performance than a single node can provide.

Now that we’ve learned how to scale a Virtual machines performance, let’s see how the same rules apply to physical servers.

So you’ve got your physical server and a Nutanix cluster, now what?

As Part 3 and Part 4 explained, more virtual disks increase the storage performance for virtual machine, the same is true for physical servers using ABS.

Virtual disks will be presented to the physical server via iSCSI (ABS), for optimal performance you should have at least one virtual disk per node in your cluster. The reason for this is each vDisk is managed by a stargate (Nutanix IO engine) instance which runs in every Controller VM (CVM).

If you have a four node cluster, you need to use at least four virtual disks to utilise the four node cluster optimally. For an eight node cluster, eight or more virtual disks is required to ensure all CVMs (stargate instances) can actively provide a boost in performance.

The following tweet shows how the pathing increased from four on the four node cluster and when an additional fours node were added the pathing dynamically changed to use all eight nodes.

Therefore when using ABS for physical workloads, especially those high end database servers, I recommend using a minimum of 8 vDisks however if your cluster size is greater than 8, match the number of vDisks with the cluster size as your starting point.

If you have an 8 node cluster, you could for example use 32 vDisks and these will spread evenly across the nodes, resulting in four per stargate instance which is perfectly fine.

Using more vDisks than your current cluster size also means when additional nodes are added, ABS can dynamically load balance the vDisks across the new and existing nodes to automatically scale your performance.

Let’s cover the same MS Exchange and MS SQL examples covered for Virtual Machines in Parts 3 and 4 but now specifically for physical servers using ABS.

Let’s say we have an MS Exchange server with 20 databases, the performance requirements for each database is typically in the range of hundred of IOPS, in which case I would recommend one virtual disk (e.g.: VMDK) per database and another for the logs.

In the case of a large MS SQL server which may require tens or hundreds of thousands of IOPS to a single database, I recommend using multiple vDisks per database which involves Splitting SQL datafiles across multiple VMDKs to optimise VM physical server performance.

Sound familiar? The above two paragraphs are literally a copy/paste from Part 3 because the exact same rules apply to physical servers and virtual machines. Simple right!

Still need more performance?

Again, the exact same rules apply to physical servers with ABS as they do to virtual machines. In no particular order, as we’ve learned from Part 3 & 4:

  • Increase the vCPU of the Nutanix Controller VM (CVM)
  • Increase the vRAM of the Nutanix Controller VM (CVM)
  • Add storage only nodes

Can’t get much easier than that!

Summary:

From Parts 3, 4 and 5 we have learned that Nutanix provides the ability to scale the performance of individual servers, be it physical or virtual using the same simple strategies of adding virtual disks, storage only nodes or Controller VM (CVM) resources and how doing so increases performance to meet virtually (pun intended) any performance requirements.

Is there any reason you couldn’t confidently say Nutanix is doing 3 tier better than the SAN vendors? I’d love to hear if you have any corner cases.

Back to the Scalability, Resiliency and Performance Index.

Nutanix gets Microsoft blessing for unique ESRP for a real world MS Exchange ESRP solution on All Flash

I am pleased to announce that Microsoft have approved Nutanix latest ESRP (Exchange Storage Review Program) submission for a 50,000 user deployment of MS Exchange on Nutanix NX-8150 all flash platform running the next generation hypervisor, AHV!

What’s unique about this you might ask?

  1. It’s the first hyper-converged (HCI) all flash ESRP solution (to compliment Nutanix existing Hybrid ESRP solutions for 24k users on Hyper-V and 30k users on AHV)
  2. The first multiple Exchange VM per node solution!!
  3. The first ESRP to provide MS Exchange Server role requirements calculator solution design
  4. The solution was performance tested/validated with N-1 nodes to simulate performance in the event a node had failed and was not replaced
  5. The solution supports the 1GB mailboxes without any assumed data reduction from compression, deduplication or Erasure Coding (EC-X)

The last point is key. Many vendors/solutions assume high data reduction ratios when sizing which adds risk to a project as I explained in Sizing infrastructure based on vendor Data Reduction assumptions. Nutanix (and me personally) rather give customers a guaranteed business outcome and while our data reduction is very effective especially for MS Exchange data, it can and does vary between customers. An ESRP should be a guaranteed outcome, and that’s what this unique ESRP from Nutanix delivers.

A major problem with many, if not most ESRP submissions is that they are not real world solutions, just storage platforms which can deliver high enough IOPS to potentially support a real world solution.

When designing the solution I planned to put forward for ESRP, I used an actual real world design for a Nutanix customer and ensures it was sized to be 100% real world.

For example, from a compute perspective the solution was sized with no CPU overcommitment and within the recommended maximum of 24 CPUs both of which ensure optimal CPU performance.

CPU sizing also ensures Exchange VMs fit within the NUMA node of the Nutanix node which ensures optimal memory performance, which is another key area to ensure optimal Exchange performance.

In addition, The VMs are sized to be under the Microsoft recommended CPU utilization threshold for a “Worst Failure Mode” of ≤ 80 percent.

From a real world perspective, MS Exchange is dependant on Active Directory. As a result the solution is also sized to support all the required Active Directory Global Catalog cores running on the same infrastructure.

From an availability and resiliency perspective, the solution is sized for N+1 at the infrastructure layer to compliment the N+1 at the MS Exchange DAG layer. This delivers customers a solution which has protection from multiple concurrent failures which is essential for Mission Critical applications.

In the real world, things change and having a solution which scales to support more users, more messages per day and greater mailbox capacity is essential.

The Nutanix NX-8150 All Flash ESRP discusses a scalable and repeatable model where the solution can be increased in size from supporting 1 GB mailboxes to >2 GB simply by choosing (configure to order) 3.84 TB drives vs. the 1.92 TB drives tested for this solution.

Another option is when the storage capacity is reaching a high threshold such as 80%+, customers can non disruptively add storage nodes to expand capacity. This can be done without any change at the OS or MS Exchange application layer and new capacity (and performance!) is available instantly.

Did you know Nutanix allows mixing all-flash & hybrid? This means the most active data (e.g.: Most recent email) is running in an all flash configuration and older mail is automatically and transparently migrated to the lower cost hybrid nodes.

From a storage performance perspective, the solution was tested with in-line compression enabled which is Nutanix official recommendation for MS Exchange as it provides excellent data reduction with no significant overheads.

Another focus are for Nutanix in the real world is reducing CAPEX and OPEX. A great example of this is the entire solution (excluding networking) uses just 10 rack units (RUs) per datacenter. While other vendors storage ESRPs will claim lower RU requirements, they excluding the physical servers required for the solution. Nutanix is advising the requirements for the compute and storage for the solution to be totally transparent.

This means the solution does not require a large investment in your datacenter or co-location and is cost effective to power and cool making the solution environmentally friendly as well.

From a performance perspective, the Nutanix solution was tested in an N-1 configuration to show the performance which can be achieved after the failure of a node within the cluster.

Even with a failed node, the solution achieves excellent performance with average database read and log write latency in the low 1ms range sustained for the 24 stress test required for ESRP submissions.

A few performance highlights:

  1. Nutanix achieved an average of 5172 IOPS per MS Exchange Jetstress instance with just 4 threads!
  2. Database read latency avg of just 1.05ms
  3. Log write latency avg of just 1.21ms
  4. Database backup performance of 215MB/sec per database which equates to more than 1.7GBps per node!

While the achieved performance vastly exceeds the requirements for Exchange, the key factor is the reduced CPU WAIT time achieved which results in much greater CPU efficiency than a physical Exchange server with JBOD storage. Meaning a virtualised exchange server on Nutanix (even hybrid systems) is more efficient than Microsoft Preferred architecture using physical servers and JBOD storage.

You may be asking yourself, why does this matter? The answer is simple. MS Exchange becomes inefficient when scaled up beyond 24 cores so the more efficient the usage of those cores, the more users, messages per day and better user experience can be achieved without scaling up or adding more servers.

So without further delay, I have provided the direct link to the document below for you convenience.

Nutanix ESRP – NX-8150-G5 All Flash 50,000 Users

ntnxallflashesrp

Scale out performance testing with Nutanix Storage Only Nodes

At Nutanix inaugural user conference in 2015, Storage Only nodes were announced which allowed customers for the first time to scale capacity without having to add compute nodes. This allows customers more flexibility and eliminates the need to license the storage nodes for vSphere as storage only nodes run Acropolis Hypervisor (AHV) and are managed entirely through PRISM.

A common question from prospective and existing Nutanix customers is what if my VMs storage exceeds the capacity of a Nutanix node? The answer is detailed in this blog post but in short, as the Acropolis Distributed Storage Fabric (ADSF) distributes data throughout the cluster at a 1MB granularity, a VMs storage can exceed the local node and performance even improves including reads from the capacity (SAS/SATA) tier.

Storage only nodes were previously limited to the NX-6035C (and Dell XC/Lenovo HX equivalents) but at Nutanix .NEXT conference in Las Vegas 2016, it was announced that any node (including all-flash) can be a storage only node.

This means even for high performance and/or high capacity environments, Nutanix clusters can be scaled without the need to add compute node or purchase additional licensing if you are running vSphere as the hypervisor.

However to date Nutanix are yet to publish any performance data showing the value of storage only nodes, so I decided to run a few tests and demonstrate the value of the Acropolis Distributed Storage Fabric (ADSF) and Storage Only Nodes.

Before we get to the performance data, to avoid competitors inevitable attempts to create FUD about Nutanix performance, I will not be publishing the exact specifications of the node types, drive or Jetstress configurations. I will be publishing the IOPS/latency and database creation, duplication and checksumming durations of the direct comparisons which clearly show the performance advantage of storage only nodes.

Jetstress was not configured to demonstrate maximum performance of the underlying Nutanix solution, it was configured to achieve around 1000 IOPS which is typically higher than even a large Exchange deployment requires per instance. This also allows this test to demonstrate how performance improves when the cluster is performing real world levels of IO (at least in the case of Exchange for this example).

The performance advantage will vary between node types and based on how many storage only nodes are added to the cluster. But the point of this example is to show that ADSF is a truely distributed storage fabric and the storage only nodes and additional Nutanix Controller VMs (CVMs) servicing replication (RF) traffic and remote reads significantly improves performance for VMs residing on the Compute+Storage nodes.

Test Overview:

The first test will be performed using four Jetstress VMs running on a four node cluster. The second test will be performed after an additional four storage only nodes are added to the cluster to form an eight node cluster. Before the second test the cluster will be wiped of all data with the exception of the Windows 2012 R2 template and all Jetstress DBs will be created from scratch so we can compare DB creation as well as performance and DB checksumming durations. Wiping all data also ensures there is no pre-warming of the extent cache (in memory read cache) or metadata cache.

Test Preparation:

I performed a cluster stop / cluster destroy / cluster create to ensure the cluster is totally clean and that we have a fair baseline for the test. The cluster was made up of four nodes.

I then created a base Windows 2012 R2 virtual machine with 4 PVSCSI adapters and 9 vDisks, one for the OS, 4 for the DBs and 4 for the logs. DB drives were formatted with 64k allocation size and log drives with 4k as the different allocation size and seperate virtual disks has shown approx 25% performance improvement in my testing not to mention I recommend In-Line compression and Erasure Coding (EC-X) for Exchange databases and no data reduction for logs.

Jetstress was configured to use 80% of the vDisks capacity which resulted in approx 80% of the Nutanix storage pool capacity being utilised for the test. I will point out these were not low capacity nodes such as NX3060s so the database creation time is significant because there was lots of data to create.

I then cloned the VM 3 times and spread the 4 VMs across 4 Nutanix Nodes running ESXi 5.5 Update 3.

Test 1: Create Databases and run 2hr test

The databases creation phase creates one database, then Jetstress duplicates the database in this case 3 times and immediately after creation the performance test begins.

Note: No data reduction was used for this test as it will result in unrealistic data reduction and performance results as I described in the post Jetstress Testing with Intelligent Tiered Storage Platforms.

I configured Jetstress in this way to ensure the extent cache (in memory read cache) was not pre-warmed and so the results of the test would be fair and repeatable.

Once the performance test completed, I waited for each test to complete before allowing the database checksum validation task to complete. (This is done by using the Multi-host option in Jetstress).

The results for each of the four Jetstress VMs are shown below including the average across the VMs for each of the difference metrics.

Jetstress4NodesSummary

Observations from Test 1:

  1. We achieved the desired >1000 IOPS per VM
  2. Performance was consistent across all Jetstress instances
  3. Log writes were in the 1ms range as they were serviced by the ADSF Oplog (persistent write buffer)
  4. Database reads were on average just under 10ms which is well below the Microsoft recommended 20ms
  5. The Database creation time averaged 2hrs 24mins
  6. The duplication of 3 databases averaged 4hrs 17mins
  7. The database checksum took on average around 38mins

Test 2: Delete all data, Add four nodes to the cluster & repeat test 1

All Jetstress VMs were deleted and a full curator scan manually initiated to ensure all data was fully removed from disk prior to beginning the next test which ensured a fair baseline.

Four Jetstress VMs were then deployed from the same template, powered on and the saved Jetstress configuration was applied before beginning the test.

Note: The Jetstress thread count was not changed and remains the same as for Test 1.

As with Test 1 the databases creation phase created one database, then Jetstress duplicates the database 3 times and immediately after creation the performance test begins and ran for the same 2hr duration.

The results for each of the four Jetstress VMs are shown below including the average across the VMs for each of the difference metrics.

Jetstress8NodesSummary

Observations from Test 2:

  1. Achieved IOPS jumped by almost 2x
  2. Log writes average latency was lower by 13%
  3. Database write latency dropped by >20%
  4. Database read latency dropped by almost 2x
  5. The Database creation time was just under 15 mins faster
  6. The duplication of 3 databases improved by almost 35 mins
  7. The database checksum was 40 seconds faster.

Without changing the Jetstress thread count, due to the improved performance of the cluster the achieved IOPS jumped by 2x!!

Summary:

These tests is a clear demonstration of the scalability advantage of the Acropolis Distributed Storage Fabric (ADSF) and storage only nodes for customers wanting to increase performance and/or capacity in their HCI environment.

The ability of ADSF to distribute write IO across all nodes within a cluster means write performance improves significantly with the addition of nodes (including storage only) to the cluster while reducing read and write latency due to the decreased workload on the compute + storage nodes servicing the VMs.

But data locality is lost with storage only nodes, right?

Wrong! Storage only nodes actually improve (yes, improve!) data locality by maximising the amount of available space on the compute+storage nodes. This is as a direct result of storage only nodes accepting replication data for write IO and storing the 2nd or 3rd copies (in the case of RF3) on the storage only nodes. This is also demonstrated by the lower read latency observed during this test.

Storage only nodes not only improve the performance and capacity for Virtual machines, but also for physical servers using Acropolis Block Services (ABS) and users of Acropolis File Services (AFS) both of which had enhancements announced at .NEXT 2016 this year.