The Key to performance is Consistency

In recent weeks I have been doing lots of proof of concepts and performance testing using tools such as Jetstress (with great success I might add).

What I have always told customers is to focus on choosing a solution which comfortably meets their performance requirements while also delivering consistent performance.

The key word here is consistency.

Many solutions can achieve very high peak performance especially when only testing cache performance, but this isn’t real world as I discussed in Peak Performance vs Real World Performance.

So with two Jetstress VMs on a 3 node Nutanix cluster (N+1 configuration) I configured Jetstress to create multiple databases which used about 85% of the available capacity per node. The nodes used were hybrid, meaning some SSD and some SATA drives.

What this means is the nodes have ~20% of data within the SSD tier and the bulk of the data residing within the SATA tier as shown in the Nutanix PRISM UI on the Storage tab as shown below.

Tierusage

As Jetstress performs I/O across all data concurrently, it means that things like caching and tiering become much less effective.

For this testing no tricks have been used such as de-duplicating Jetstress DBs, which are by design duplicates. Doing this would result in unrealistically high dedupe ratios where all data would be served from SSD/cache resulting in artificially high performance and low latency. That’s not how I roll, I only talk real performance numbers which customers can achieve in the real world.

In this post I am not going to talk about the actual IOPS result, the latency figures or the time it took to create the databases as I’m not interested in getting into performance bake offs. What I am going to talk about is the percentage difference in the following metrics between the nodes observed during these tests:

1. Time to create the databases : 1.73%

2. IOPS achieved : 0.44%

3. Avg Read Latency : 4.2%

As you can see the percentage difference between the nodes for these metrics is very low, meaning performance is very consistent across a Nutanix cluster.

Note: All testing was performed concurrently and background tasks performed by Nutanix “Curator” function such as ILM (Tiering) and Disk Balancing were all running during these tests.

What does this mean?

Running business critical workloads on the same Nutanix cluster does not cause any significant noisy neighbour types issues which can and do occur in traditional centralised shared storage solutions.

VMware have attempted to mitigate against this issue with technology such as Storage I/O Control (SIOC) and Storage DRS (SDRS) but these issues are natively eliminated thanks to the Nutanix scale out shared nothing architecture. (Nutanix Xtreme Computing Platform or XCP)

Customers can be confident that performance achieved on one node is repeatable as Nutanix clusters are scaled even with Business Critical applications with large working sets which easily exceed the SSD tier.

It also means performance doesn’t “fall of the cache cliff” and become inconsistent, which has long been a fear with systems dependant on cache for performance.

Nutanix has chosen not to rely on caching to achieve high read/write performance, instead we to tune our defaults for consistent performance across large working sets and to ensure data integrity which means we commit the writes to persistent media before acknowledging writes and perform checksums on all read and write I/O. This is key for business critical applications such as MS SQL, MS Exchange and Oracle.

Advanced Storage Performance Monitoring with Nutanix

Nutanix provides excellent performance monitoring and analytic capabilities through our HTML 5 based PRISM UI, but what if you want to delve deeper into the performance of a specific business critical application?

Nutanix also provides advanced storage performance monitoring and workload profiling through port 2009 on any CVM which shows very granular details for Virtual disks.

By default, Nutanix secures our CVM and the http://CVM_IP:2009 page is not accessible, but for advanced troubleshooting this can be enabled by using the following command.

sudo iptables -t filter -A WORLDLIST -p tcp -m tcp –dport 2009 -j ACCEPT

 

When accessing the 2009 page (which is part of the Nutanix process called “Stargate”) you will see things like Extent (In Memory Read) cache usages and hits as well as much more.

On the main 2009 page you will see a section called “Hosted VDisks” (shown below) which shows all the current VDisks (equivalent of a VMDK in ESXi) which are currently running on that node.

HostedvDisks

 

The Hosted VDisks shows high level details about the VDisk such as Outstanding Operations, capacity usage, Read/Write breakdown and how much data is in the OpLog (Persistent Write Cache).

If you need more information, you can click on the “VDisk Id” and you will get to a page titled “VDisk XXXXX Stats” where the XXXXX is the VDisk ID.

The below is some of the information which can be discovered in the VDisk Stats Page.

VDisk Working Set Size (WWS)

The working set size can be thought of as the data which you would ideally want to fit within the SSD tier of a Nutanix node, which would result in all-flash type performance.

In the below example, in the last 2mins, the VDisk had a combined (or Union) working set of 6.208GB and over the last 1hr over 111GB.

WSSExchange

 

 

VDisk Read Source

The Read Source is simply what tier of storage is servicing the VDisks IO requests. In the below example, 41% was from Extent Cache (In Memory), 7% was from the SSD Extent Store and 52% was from the SATA Extent Store.
ReadSource

 

In the above example, this was an Exchange 2013 workload where the total dataset was approx 5x the size of the SSD tier. The important point here is its not always possible to have all data in the SSD tier, but its critical to ensure consistent performance. If 90% was being served from SATA and performance was not acceptable, you could use this information to select a better node to migrate (vMotion) the VM too, or help choose to purchase a new node.

VDisk Write Destination

The Write Destination is fairly self explanatory, if its Oplog it means its Random IO and its being written to SSD, if its straight to the extent store (SSD) it means the IO is either sequential, OR in rare cases the OpLog is being bypassed if the SSD tier reached 95% full (which is generally prevented by Nutanix ILM tiering process).

WriteDestination

VDisk Write Size Distribution

The Write Size Distribution is key to determining things like the Windows Allocation Size when formatting drives as well as understanding the workload.

WriteSizeOverall

VDisk Read Size Distribution

The Read Size Distribution is similar to Write Size in that its key to determining things like the Windows Allocation Size when formatting drives as well as understanding the workload. In this case, a 64k allocation size would be ideal as both the Write (shown above) and the Read (below) are >32K and <64K 86% of the time. (Which is expected as this was an Exchange 2013 workload).

ReadSizeExchange

VDisk Write Latency

The Write Latency shows the percentage of Write I/O which are serviced within the latency ranges shown. In this case, 52% of writes are sub-millisecond. It also shows for this vDisk 1% of IO being outliers being served between 5-10ms. This is something that outside of a lab, if the outliers were a significant percentage that could be investigated to ensure the VM disk configuration (e.g.: PVSCSI and number of VMDKs) is optimal.

WriteLatency

VDisk Ops and Randomness

Here we see the number of IOPS, the Read/Write split, MB/s and the split between Random and Sequential.

vDisksOps

Summary

For any enterprise grade storage solution, it is important that performance monitoring be easy as it is with Nutanix via PRISM UI, but also to be able to quickly and easily dive deep into very granular details about a specific VM or VDisk. The above shows just a glimpse of the information which is tracked by default for all VDisks allowing customers , partners and Nutanix support to quickly and easily monitor & profile workloads.

Importantly these capabilities are hypervisor agnostic giving customers the same capabilities no matter what choice/s they make.

 

Related Posts:

1. Scaling Hyper-converged solutions – Compute only.

2. Acropolis Hypervisor (AHV) I/O Failover & Load Balancing

3. Advanced Storage Performance Monitoring with Nutanix

4. Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

5. Nutanix – Erasure Coding (EC-X) Deep Dive

6. Acropolis: VM High Availability (HA)

7. Acropolis: Scalability

8. NOS & Hypervisor Upgrade Resiliency in PRISM

Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

As cluster sizes increase, it is important to understand the chance of multiple concurrent failure also increases and to architect solutions to ensure resiliency is maintained.

Because scalability is one of many strengths of the Nutanix Distributed Storage Fabric, Nutanix supported multiple data protection levels (RF2 and RF3) to ensure resiliency could be scaled with cluster size.

However using RF3 results in reducing the usable capacity to approximately 33% of the formatted capacity of the drives within the cluster which means it is sometimes considered undesirable.

But because some customers require the ability to support multiple concurrent node failures without the chance of data loss or unavailability, RF3 has been required.

Enter Nutanix Erasure Coding (EC-X)!

Now lets say you have a 32 node cluster where each node has 10TB RAW.

With RF3 we would have approx 3.33TB usable per node for a total of 106.56TB in the cluster.

With EC-X enabled (assuming EC-X has been applied to all data) the usable capacity would DOUBLE to 6.66TB per node and 213.12TB for the cluster.

Here’s how it works.

For RF3, the Nutanix Distributed Storage Fabric writes and maintains three copies of each piece of data. The below shows three copies of data “A” and “B”.

RF3

The below is a simplified example of what the Nutanix Distributed Storage Fabric looks like once EC-X is applied to RF3 data.

RF3plusECX

As you can see, we now support twice the amount of data as RF3 while still having dual parity. As a result, using RF3 + EC-X gives customers using large clusters MORE usable capacity than RF2 (~50% of RAW) while providing dual parity (which enables the loss of two nodes without data loss/unavailability).

Not bad for a software only upgrade!

So what do I recommend customers who are running 32 node or larger clusters?

1. For customers running RF3 already, Consider enabling EC-X.
2. For customers running RF2, consider enabling RF3 and EC-X