Nutanix X-Ray Benchmarking tool – Snapshot Impact Scenario

In the first part of this series, I introduced Nutanix X-Ray benchmarking tool which has been designed very differently to traditional benchmarking tools as the performance of the app is the control and the variable is the platform,not the other way around.

This is done by generating realistic IO patterns (e.g.: Not 100% 4k read) and then performing functions against the platform to see how the control (the VM application performance) is impacted by the underlying platforms functionality.

A great example of this is performing snapshots as the first step in a space efficient backup solution.

X-Ray has a built in test which generates an OLTP workload which is ran for 8 hours which for an all flash platform generates 6000 IOPS across the database and 400 IOPS for the logs. The scenario is detailed in the X-Ray report shown below.

XraySnapshotImpactDescription

The Snapshot impact scenario is then ran against multiple platforms and using the Analysis functionality within X-ray. we can generate a report which overlays the results from multiple platforms.

The below example is GA Acropolis Hypervisor (AHV) on AOS 5.1.1 verses a leading hypervisor and SDS platform showing the snapshot impact scenario.

XraySnapshotImpact

Each of the red lines indicate a snapshot and what we observe is the performance of both platforms remains consistent until the 10th snapshot (shown below) where the Nutanix platform continues without impact and the leading hypervisor and SDS platform starts degrading significantly.

XraySnapshotImpactSnap10

In the real world, customers use the intelligent features of storage, SDS or hyper-converged platforms but rarely test how this functionality works prior to purchasing. This is because it’s difficult and time consuming to do so.

Nutanix X-Ray tool makes the process of validating a platforms performance under real world scenarios a quick and easy process and provides automatically generated reports where accurate comparisons can be made.

What this example shows is that while both platforms could achieve the required performance without snapshots, only Nutanix AHV & AOS could maintain the performance while utilising snapshots to achieve the type of recovery point objective (RPO) that is expected in production environments, especially with business critical workloads.

As part of the Nutanix Solutions and Performance engineering organisation, I can tell you that the focus for Nutanix is real world performance, using data reduction, leveraging snapshots, mixing workloads and testing a large scale.

In upcoming posts I will show more examples of X-Ray test scenarios as well as comparisons between GA Acropolis Hypervisor (AHV) & AOS 5.1.1 verses a leading hypervisor and SDS platform.

Related Articles:

Nutanix X-Ray Benchmarking tool Part 1 – Introduction

Nutanix X-Ray Benchmarking tool Part 3 – Extended Node Failure Scenario

What’s .NEXT 2016 – Acropolis File Services (AFS)

At .NEXT 2015 Nutanix announced the Scale out File Server Tech Preview which was supported for AHV environments only. With the imminent release of AOS 4.7 the Scale out File Server has been renamed to Acropolis File Services (AFS) and will now be GA for AHV and ESXi.

AFS provides what I personally refer to as an “invisible” file server experience because it can be setup with just a few clicks in PRISM without the need to deploy operating systems.

AFS provides a highly available and distributed single namespace across 3 or more front end VMs which are automatically deployed and maintained by ADSF. The below shows a mixed cluster of 10 nodes made up of 8 x NX3060 and 2 x NX6035C nodes with the AFS UVMs spread across the cluster.

AFSoverview

Data is then stored on the underlying Acropolis Distributed Storage Fabric (ADSF) in a Container which can be configured with your desired level of resiliency e.g.: RF2 or RF3 as well as data reduction features such as Compression, Deduplication and Erasure coding.

AFS inherits all of the resiliency that ADSF natively provides and supports operational tasks such as one-click rolling upgrades of AOS and hypervisor without impacting the availability of the file services.

Functionality

Backups

Nutanix will provide AFS with native support for local recovery points on the primary storage (cluster) and allow both Async-DR (60 mins) and Sync-DR (0 RPO) to allow data to be backed up to remote cluster.

For customers who employ 3rd party backup tools, AFS can also be simply backed up as an SMB share which is a common capability amongst backup vendors such as Commvault and Netbackup.

The below shows a high level of what a 3rd party backup solution looks like with AFS.

AFSbackup2

Quotas

AFS also allows administrators to set quotas to help with capacity management especially in environments with multi-tenant or departmental deployments to avoid users monopolising capacity in the environment.

Patching/Upgrades

Acropolis File Server can be upgraded and patched separately to AOS and the underlying hypervisor. This ensures that the version of AFS is not dependant on the AOS or hypervisor versions which also makes QA easier and minimizes the chance of bugs since the AFS layer is abstracted from the AOS and hypervisor.

This is similar to how the AOS version is not dependant on a hypervisor version, ensuring maximum flexibility and stability for customers. This means as new features/improvements are added, AFS can be upgraded via PRISM without worrying about interoperability and dependancies.

Patches and upgrades are one-click, rolling, non-disruptive upgrades the same as AOS.

Scaling

As the file serving workload increases, Acropolis File Server can be scaled out by simply adding instances to balance the workload across. If the Nutanix cluster has more nodes than AFS instances, this can be done quickly and easily through prism.

If the cluster has for example 4 nodes and 4 AFS instances are already deployed, then to scale the performance of the AFS environment the UVMs vCPU/vRAM can be scaled up OR additional nodes can be added to the cluster and AFS instances scaled out.

When one or more additional AFS instances (UVM) are added, the workload is automatically balanced across all UVMs in the environment. ADSF will also automatically balance the new and existing file server data across the ADSF cluster to ensure even capacity utilization across nodes as well as consistent performance and linear scaling.

So in short, AFS provides both scale up and scale out options.

Interoperability with Storage Only nodes

Acropolis File Server is fully supported on environments using storage only nodes. As the storage nodes provide a Nutanix CVM and underlying storage to ADSF, the available capacity and performance is made available to AFS just like it is to any other VM. The only requirement is 3 or more Compute+Storage nodes in a cluster to support the minimum 3 AFS UVMs.

AFS deployment examples

Acropolis File Services can be deployed on existing Nutanix clusters which allows file data to be co-located on the same storage pool with existing data from virtual machines as well as with physical or virtual servers utilising Acropolis Block Services (ABS).

AFS_ExistingCluster

Acropolis File Services can be deployed on dedicated clusters such as storage heavy and storage only nodes for environments which do not have virtual machines, or for very large environments while be centrally managed along with other Nutanix clusters via PRISM Central.

AFS_DedicatedCluster

Multi-tenancy

AFS also allows multiple seperate instances to be deployed in the same Nutanix cluster to service different security zones, tenants or use cases. The following shows an example of a 4 node Nutanix cluster with two instances of AFS. The first has 4 AFS instances (UVMs) and the second has just 3 instances. Each instance can have different data reduction (Compression, Dedupe,EC-X) settings and be scaled independently.
AFSMultipleFileServers

Summary:

  • AFS supports multiple hypervisors and is deployed in mins from PRISM
  • Can be scaled both up and out to support more users, capacity and/or performance
  • Interoperable with all OEMs and node types including storage only
  • Supports non-disruptive one-click rolling upgrades
  • Supports multiple AFS instances on the one cluster for multi-tenancy and security zone support
  • Has native local recovery point support as well as remote backup (Sync and Async) support
  • All data is protected by the underlying ADSF
  • Supports all ADSF data reduction technologies including Compression, Dedupe and Erasure Coding.
  • Eliminates the requirement for a silo for File sharing
  • Capacity available to AFS is automatically expanded as nodes are added to the cluster.

Related .NEXT 2016 Posts

Example Architectural Decision – Guest OS Page File Storage in vSphere

Problem Statement

In a vSphere environment using deduplication and an array snapshot based backup solution, Guest OS page files are currently stored on the OS drive (VMDK) which reduces the effectiveness of deduplication as well as placing an overhead on the controllers having to scan data which cannot be deduplicated.

As the Guest OS Paging files are being included in the snapshot process (with the guest OS) this also demands additional capacity for both primary and secondary disk storage for disk to disk backups.

How can this overhead be minimized or eliminated?

Requirements

1. Make the most efficient use of the available storage capacity
2. Maintain consistent level of virtual machine / storage performance
3. Minimize the storage required for primary and secondary snapshot based backups
4. Maintain the array level snapshot based backup solution as it is required to meet RPO/RTOs
5. Maintain the use of deduplication and this has proven to decrease storage requirements and improve performance

Assumptions

1. vSphere 5.0 or later
2. VMFS 5 Datastores which are Thin Provisioned
3. Deduplication is in use for Volumes where Guest OS virtual disks are stored
4. VAAI is supported by the array and enabled across the vSphere environment
5. All datastores are presented to all hosts within the cluster
6. Snapshot based backup solution is being used
7. Virtual Machines are right sized
8. Disk to disk backup data is replicated offsite

Constraints

1. None

Motivation

1. Optimize the storage performance
2. Ensure Tier 1 storage is not wasted with transient files
3. Minimize storage required for snapshot based backups

Architectural Decision

Separate OS page files onto a dedicated VMDK, which will be located on a datastore (or datastore cluster) which is
1. Not Protected by the array level snapshot backup solution
2. Not running deduplication
3. Not running data compression

Justification

1. Allows page files to be stored on different underlying storage including (optionally) high capacity, lower cost, SATA disk
2. Relocating Guest OS page files to another datastore (or datastore cluster) not protected ny snapshots dramatically reduces the amount of Data being protected by the Snapshot based backup solution
3. Reduces the amount of data being replicated to secondary disk backup location/s thus minimizing the bandwidth requirements between datacenters
4. (Optionally) Ensures Tier 1 storage is only used for high performance guests
5. The result of the Virtual Machines being right sized the performance impact/frequency of paging should be minimal
6. Reduces the CPU cycles required for deduplication as data which cannot be deduplicated will not be scanned
7. Reduces the CPU cycles on the storage controllers by not attempting to compress page file data

Alternatives

1. Leave Page Files within the Virtual machines primary VMDK an accept the overhead on the backup solution
2. Turn of paging within the Guest OS (No Page File)

Implications

1. The additional steps of creating a dedicated VMDK for the VM and configuring the Guest OS to use the alternate location
2. Templates need to be updated to the above configuration
3. For environments using Site Recovery Manager,for protected virtual machines, some manual steps are required when setting up the virtual machines for the first time. This increases the work required during setup, however as this is a one time overhead, it is believed the benefit of reduced backup storage and replication traffic (for SRM) outweighs the one time overhead

vmware_logo_ads