Nutanix X-Ray Benchmarking tool – Extended Node Failure Scenario

In the first part of this series, I introduced Nutanix X-Ray benchmarking tool which has been designed very differently to traditional benchmarking tools as the performance of the app is the control and the variable is the platform,not the other way around.

In the second part, I showed how Nutanix AHV & AOS could maintain the performance while utilising snapshots to achieve the type of recovery point objective (RPO) that is expected in production environments, especially with business critical workloads whereas a leading hypervisor and SDS platform could not.

In this part, I will cover the Extended Node Failure Scenario in X-Ray and again compare Nutanix AOS/AHV and a leading hypervisor and SDS platform in another real world scenario.

Let’s start by reviewing what the description of the X-ray Extended node failure scenario.

XrayExtendedNodeFailureScenario

I really like that X-ray has a scenario which shows a simulated node failure as this is bound to happen regardless of the platform you choose, and with hyperconverged platforms the impact of a node failure is arguably higher than traditional 3-tier as the nodes contain some data which needs to be recovered.

As such, it is critical before choosing a HCI platform to understand how it behaves in a failure scenario which is exactly what this scenario demonstrates.

XrayNodeFailureComparison

Here we can see the impact on the performance of the surviving VMs following the power being disconnected via the out of band management interface.

The Nutanix AOS/AHV platform continues to run at a very steady rate, virtually without impact to the VMs. On the other hand we see that after 1 hour the other platform has a high impact with significant degradation.

This clearly shows the Acropolis Distributed Storage Fabric (ADSF) to be a superior platform from a resiliency perspective, which should be a primary consideration when choosing a platform for any production environment.

Back in 2014, I highlighted the Problems with RAID and Object Based Storage for data protection and in a follow up post I discussed how Nutanix Acropolis Distributed Storage Fabric (ADSF) compares with traditional SAN/NAS RAID and hyper-converged solutions using Object storage for data protection.

The above results clearly demonstrate the problems I discussed back in 2014 are still applicable to even the most recent versions of a leading hypervisor and SDS platform. This is because the problem is the underlying architecture and bolting on new features is at best masking the constraints of the original architectural decision which has proven to be significantly flawed.

This scenario clearly demonstrates the criticality of looking beyond peak performance numbers and conducting a thorough evaluation of a platform prior to purchase as well as comprehensive operational verification prior to moving any platform into production.

Related Articles:

Nutanix X-Ray Benchmarking tool Part 1 – Introduction

Nutanix X-Ray Benchmarking tool Part 2 -Snapshot Impact Scenario

Nutanix X-Ray Benchmarking tool – Snapshot Impact Scenario

In the first part of this series, I introduced Nutanix X-Ray benchmarking tool which has been designed very differently to traditional benchmarking tools as the performance of the app is the control and the variable is the platform,not the other way around.

This is done by generating realistic IO patterns (e.g.: Not 100% 4k read) and then performing functions against the platform to see how the control (the VM application performance) is impacted by the underlying platforms functionality.

A great example of this is performing snapshots as the first step in a space efficient backup solution.

X-Ray has a built in test which generates an OLTP workload which is ran for 8 hours which for an all flash platform generates 6000 IOPS across the database and 400 IOPS for the logs. The scenario is detailed in the X-Ray report shown below.

XraySnapshotImpactDescription

The Snapshot impact scenario is then ran against multiple platforms and using the Analysis functionality within X-ray. we can generate a report which overlays the results from multiple platforms.

The below example is GA Acropolis Hypervisor (AHV) on AOS 5.1.1 verses a leading hypervisor and SDS platform showing the snapshot impact scenario.

XraySnapshotImpact

Each of the red lines indicate a snapshot and what we observe is the performance of both platforms remains consistent until the 10th snapshot (shown below) where the Nutanix platform continues without impact and the leading hypervisor and SDS platform starts degrading significantly.

XraySnapshotImpactSnap10

In the real world, customers use the intelligent features of storage, SDS or hyper-converged platforms but rarely test how this functionality works prior to purchasing. This is because it’s difficult and time consuming to do so.

Nutanix X-Ray tool makes the process of validating a platforms performance under real world scenarios a quick and easy process and provides automatically generated reports where accurate comparisons can be made.

What this example shows is that while both platforms could achieve the required performance without snapshots, only Nutanix AHV & AOS could maintain the performance while utilising snapshots to achieve the type of recovery point objective (RPO) that is expected in production environments, especially with business critical workloads.

As part of the Nutanix Solutions and Performance engineering organisation, I can tell you that the focus for Nutanix is real world performance, using data reduction, leveraging snapshots, mixing workloads and testing a large scale.

In upcoming posts I will show more examples of X-Ray test scenarios as well as comparisons between GA Acropolis Hypervisor (AHV) & AOS 5.1.1 verses a leading hypervisor and SDS platform.

Related Articles:

Nutanix X-Ray Benchmarking tool Part 1 – Introduction

Nutanix X-Ray Benchmarking tool Part 3 – Extended Node Failure Scenario

Nutanix gets Microsoft blessing for unique ESRP for a real world MS Exchange ESRP solution on All Flash

I am pleased to announce that Microsoft have approved Nutanix latest ESRP (Exchange Storage Review Program) submission for a 50,000 user deployment of MS Exchange on Nutanix NX-8150 all flash platform running the next generation hypervisor, AHV!

What’s unique about this you might ask?

  1. It’s the first hyper-converged (HCI) all flash ESRP solution (to compliment Nutanix existing Hybrid ESRP solutions for 24k users on Hyper-V and 30k users on AHV)
  2. The first multiple Exchange VM per node solution!!
  3. The first ESRP to provide MS Exchange Server role requirements calculator solution design
  4. The solution was performance tested/validated with N-1 nodes to simulate performance in the event a node had failed and was not replaced
  5. The solution supports the 1GB mailboxes without any assumed data reduction from compression, deduplication or Erasure Coding (EC-X)

The last point is key. Many vendors/solutions assume high data reduction ratios when sizing which adds risk to a project as I explained in Sizing infrastructure based on vendor Data Reduction assumptions. Nutanix (and me personally) rather give customers a guaranteed business outcome and while our data reduction is very effective especially for MS Exchange data, it can and does vary between customers. An ESRP should be a guaranteed outcome, and that’s what this unique ESRP from Nutanix delivers.

A major problem with many, if not most ESRP submissions is that they are not real world solutions, just storage platforms which can deliver high enough IOPS to potentially support a real world solution.

When designing the solution I planned to put forward for ESRP, I used an actual real world design for a Nutanix customer and ensures it was sized to be 100% real world.

For example, from a compute perspective the solution was sized with no CPU overcommitment and within the recommended maximum of 24 CPUs both of which ensure optimal CPU performance.

CPU sizing also ensures Exchange VMs fit within the NUMA node of the Nutanix node which ensures optimal memory performance, which is another key area to ensure optimal Exchange performance.

In addition, The VMs are sized to be under the Microsoft recommended CPU utilization threshold for a “Worst Failure Mode” of ≤ 80 percent.

From a real world perspective, MS Exchange is dependant on Active Directory. As a result the solution is also sized to support all the required Active Directory Global Catalog cores running on the same infrastructure.

From an availability and resiliency perspective, the solution is sized for N+1 at the infrastructure layer to compliment the N+1 at the MS Exchange DAG layer. This delivers customers a solution which has protection from multiple concurrent failures which is essential for Mission Critical applications.

In the real world, things change and having a solution which scales to support more users, more messages per day and greater mailbox capacity is essential.

The Nutanix NX-8150 All Flash ESRP discusses a scalable and repeatable model where the solution can be increased in size from supporting 1 GB mailboxes to >2 GB simply by choosing (configure to order) 3.84 TB drives vs. the 1.92 TB drives tested for this solution.

Another option is when the storage capacity is reaching a high threshold such as 80%+, customers can non disruptively add storage nodes to expand capacity. This can be done without any change at the OS or MS Exchange application layer and new capacity (and performance!) is available instantly.

Did you know Nutanix allows mixing all-flash & hybrid? This means the most active data (e.g.: Most recent email) is running in an all flash configuration and older mail is automatically and transparently migrated to the lower cost hybrid nodes.

From a storage performance perspective, the solution was tested with in-line compression enabled which is Nutanix official recommendation for MS Exchange as it provides excellent data reduction with no significant overheads.

Another focus are for Nutanix in the real world is reducing CAPEX and OPEX. A great example of this is the entire solution (excluding networking) uses just 10 rack units (RUs) per datacenter. While other vendors storage ESRPs will claim lower RU requirements, they excluding the physical servers required for the solution. Nutanix is advising the requirements for the compute and storage for the solution to be totally transparent.

This means the solution does not require a large investment in your datacenter or co-location and is cost effective to power and cool making the solution environmentally friendly as well.

From a performance perspective, the Nutanix solution was tested in an N-1 configuration to show the performance which can be achieved after the failure of a node within the cluster.

Even with a failed node, the solution achieves excellent performance with average database read and log write latency in the low 1ms range sustained for the 24 stress test required for ESRP submissions.

A few performance highlights:

  1. Nutanix achieved an average of 5172 IOPS per MS Exchange Jetstress instance with just 4 threads!
  2. Database read latency avg of just 1.05ms
  3. Log write latency avg of just 1.21ms
  4. Database backup performance of 215MB/sec per database which equates to more than 1.7GBps per node!

While the achieved performance vastly exceeds the requirements for Exchange, the key factor is the reduced CPU WAIT time achieved which results in much greater CPU efficiency than a physical Exchange server with JBOD storage. Meaning a virtualised exchange server on Nutanix (even hybrid systems) is more efficient than Microsoft Preferred architecture using physical servers and JBOD storage.

You may be asking yourself, why does this matter? The answer is simple. MS Exchange becomes inefficient when scaled up beyond 24 cores so the more efficient the usage of those cores, the more users, messages per day and better user experience can be achieved without scaling up or adding more servers.

So without further delay, I have provided the direct link to the document below for you convenience.

Nutanix ESRP – NX-8150-G5 All Flash 50,000 Users

ntnxallflashesrp