Nutanix X-Ray Benchmarking tool – Introduction

I’ve been excited to write about X-ray for a while now, but I’ve not had the time. But the opportunity has presented itself where I could kill two birds with one stone and do some performance comparisons between Nutanix AHV Turbo Mode and other platforms on the same underlying hardware, so what better time to review X-ray as part of this process.

So for those of you who have not heard of X-Ray, it wouldn’t be unreasonable to assume it’s just another benchmarking tool to further muddy the waters when comparing different platforms.

However X-Ray takes a different approach, to quote Paul Updike who is part of Nutanix Technical Marketing Engineering:

Normally performance is your test variable and you measure the effect on the system. X-ray is upside down, performance of an app in a VM is the control and our test variable is the system. We measure the effect on the control.

So if all you want is “hero numbers” you’ve come to the wrong place, although  X-Ray does have a peak performance micro-benchmark test built-in, it’s far from real world in comparison to the other tests within X-ray.

The X-Ray virtual appliance is recommended to be ran on a cluster which is not the target for the testing, such as a management cluster. But for those environments where this additional hardware may not be available, it can also be deployed on VirtualBox or VMware Workstation on your PC or laptop.

Also if you have an Intel NUC, you could deploy Nutanix Community Edition (CE) and run X-Ray on CE which is based on AHV.

In addition to the different approach X-ray takes to benchmarking, I like that X-ray performs fully automated testing across multiple hypervisors including ESXi, AHV as well as different underlying storage. This helps ensure consistent and fair comparisons between platforms, or even comparisons between Nutanix node types if you decide to compare model types before making a purchasing decision.

X-ray has several built in tests which are focused not just on outright performance, but on how a system functions and performs during node failure/s, with snapshots as well as during rolling upgrades.

The reason Nutanix took this approach is because it is much more real world than simply firing up I/O meter with lots of outstanding I/O with a 100% random 4k read. In the real world, customers performance upgrades (hopefully regularly to take advantage of new functionality and performance!), hardware does fail when we can least afford it and using space efficient snapshots as part of an overall backup strategy makes a lot of sense.

Now let’s take a look at the X-Ray interface starting with an overview:

XrayOverview

X-Ray is designed to be similar to PRISM to keep that great Nutanix look and feel. The tool is very simple to use with three sections being Tests, Analyses and Targets.

To get started is very quick/easy, just open the “Targets” view (shown below) and select “New Target”.

XrayTargets

In the “Create Target” popup, you simply, provide a name for the target e.g.: “Nutanix NX-3460 Cluster AHV”, select the Manager type, being either vCenter for ESXi environments or PRISM for AHV.

Then select the cluster type, being Nutanix (i.e.: A Nutanix NX, Dell XC, Lenovo HX or HPE/Cisco software only) OR “Non-Nutanix” which is for comparisons with platforms not running Nutanix AOS such as VMware vSAN.

XrayCreateTarget

For VMware environments, you then provide the vCenter details and regardless of the hardware type or platform, you supply the out of band management (e.g.: IPMI) details. The out of band management details allow X-ray to perform simulated hardware failure tests which are critical to any product evaluation and pre-production operational verification testing.

X-Ray then allows you to select the cluster, container (or datastore) and networking (e.g.: Port Group) to be used for the testing.

XrayCreateTarget_Cluster

X-ray then discovers the nodes (e.g.: ESXi Hosts) and allows you to add nodes and confirm the IPMI type to ensure maximum compatibility.

XrayCreateTarget_Node

Now hit “Save” and you’re good to go! Pretty simple right?

Now to run a test, simply click the test you want to run and select “Add to Queue”.

Xray_RunTestVDISim

The beauty of this is X-ray allows you to queue as many tests as you want and leave the system to run the tests, say overnight or over a weekend without requiring you to monitor them and start tests one by one.

In between tests the target systems are cleaned up (i.e.: data and VMs deleted) to ensure consistent / fair results even when running test packages one after another.

Once a test has been ran, you can view the results in the X-Ray GUI (as shown below):

XrayTestsOverview

You can also generate a PDF report for individual tests or perform analysis between two tests including of different platforms:

XrayAnalyses

The above results show and overlay between two platforms, the first being AHV (although it’s incorrectly named Turbo mode when it was ran using non Turbo mode AOS version 5.1.1). As we can see, AHV even without turbo mode was more consistent than the other platform.

To create a PDF report, simply use the “Actions” drop down menu and select “Create Report”.XrayCreateReport

The report will create a report which covers off details about X-ray, the Target cluster/s, the scenario being tested and the test results.

XrayTOCReport

It will show simple results such as if the test passed (i.e.: Completed the required tasks) and things like test duration as shown below:

XrayReportTargetOverview

X-Ray also provides built-in tests for mixed workloads, which is much more realistic than testing peak performance for point (or siloed) solutions which are become more and more rare these days. XrayMixedWorkloads

X-Ray’s built in tests are also auto scaling based on the cluster size of the target and allow tuning of the scenario. For example, in the VDI simulator scenario, Task, Knowledge or Power Users can be selected.

XRayVDISimulator
Summary:

X-Ray provides a tool which is free of charge, multi-hypervisor, multi-platform (including non-HCI) which is easy to use for proof of concepts, product comparisons as well as real world, operational verification.

I am working with the X-ray team to develop new built in test scenarios to simulate real world scenarios for business critical applications as well as to allow customers and 3rd parties to validate the benefits of functionality such as data locality.

The following is a series of posts covering Nutanix AHV Turbo Mode performance/functionality comparisons with other products.

Nutanix X-Ray Benchmarking tool Part 2 -Snapshot Impact Scenario

Nutanix X-Ray Benchmarking tool Part 3 – Extended Node Failure Scenario

Example Architectural Decision – Horizon View Desktop Power Policy for Linked Clones (1 of 2)

Problem Statement

In a VMware Horizon View environment using persistent Linked Clones, Disposable disks are being used to redirect transient paging and  temporary files to a separate VMDK.

What is the most suitable Desktop Pool setting to ensure storage overheads are reduced?

Assumptions

1. VMware View 4.5 or later
2. Recompose / Refresh cycles are infrequent
3. Desktop Usage concurrency within the pool is less than 100%
4. Memory Reservations are not being used.

Requirements

1. The environment must deliver consistent performance
2. Minimize the cost/utilization of shared storage

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Set the Power Policy for all Linked Clone desktop pools to “Power Off”

Justification

1. Using disposable disks can save storage space by slowing the growth of linked clones and reducing the space used by powered off virtual machines.
2. Using the “Power Off” policy for the pool means at user logoff (or shutdown) the disposable disk will be refreshed, therefore reducing the capacity usage at the storage layer.
3. “Powered Off” VMs do not have a Virtual Machine SWAP file which will also reduce storage consumption.

Implications

1. Setting the policy to “Power Off” will result in more frequent power operations which may impact the performance of the storage and vCenter.
2. When a user attempts to login to a desktop which has been powered off, there will be a delay while the VM is powered on and booting up before the user will be logged in.
3. The peak concurrency rate of users will need to be understood to allow accurate storage planning for the VSWAP file.

Alternatives

1. Increase the frequency of Recompose / Refresh / Rebalance operations
2. Set the Policy to “Take no power action” and schedule an Administrator task to periodically change the Power Policy to “Powered Off” during a maintenance window.
3. Set the Policy to “Ensure desktops are always powered on” and schedule an Administrator task to periodically change the Power Policy to “Powered Off” during a maintenance window.
4. Set the Policy to “Suspend”  and schedule an Administrator task to periodically change the Power Policy to “Powered Off” during a maintenance window, however this will consume extra storage for the Suspend File.
5. Use Memory Reservations to reduce storage requirements for vSwap and leave Power Policy to “Always On”.

Related Articles:

The example architectural decision was contributed to by Travis Wood (@vTravWood) and was inspired by the following article:

1. Understanding View Disposable Disks by @vTravWood (Double VCDX #97 Desktop/Datacenter Virtualization)

1. Transparent Page Sharing (TPS) Configuration for VDI (1 of 2)

2. Transparent Page Sharing (TPS) Configuration for VDI (2 of 2)

Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (2 of 2)

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for a Virtual Desktop environment?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. Average VDI user is Task Worker with 1vCPU and 2GB Ram.
5. Memory is the first compute level constraint.
6. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
7. vSphere 5.5 or earlier

Requirements

1. VDI environment costs must be minimized

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Enable TPS and disable Large Memory pages

Justification

1. Disabling Large pages is essential to maximizing the benefits of TPS
2. Not disabling large pages would likely result in minimal TPS savings
3. With Kiosk and Task worker VDI profiles, the percentage of memory which is likely to be shared is higher than for Power users.
4. Existing shared storage has plenty of spare Tier 1 capacity to vSwap files

Implications

1. Sufficient capacity for VM swap files must be catered for.
2. VDI & Storage performance may be impacted significantly in the event of memory contention.
3. Decreased memory costs may result in increased storage costs.
4. During patching, and operational verification that non default settings have not been reverted by the patching of ESXi.
5. Additional CPU overhead on ESXi from enabling TPS.
6. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM,
6. HA admission control (when configured to Percentage of Cluster resources reserved for HA) will only calculate fail-over capacity based on 0MB + VM overhead for each VM which can lead to significantly degraded performance in a HA event.
7. Higher core count (and higher cost) CPUs may be desired to drive overcommitment ratios as RAM will be less likely to be a point of contention.

Alternatives

1. Use 100% memory reservation and leave TPS disabled (default)
2. Use 50% memory reservation and Enable TPS and disable large pages

Related Articles:

1. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

2. Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (1 of 2)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl(VCDX#104)