Write I/O Performance & High Availability in a scale-out Distributed File System

Following on from my recent post titled “Data Locality & Why is important for vSphere DRS clusters” I would like to discuss at a high level how Write I/O works in the Nutanix Distributed File System, how the solution ensures high availability in the event of a node failure and what impact a failure has on performance.

Lets start with a typical Write operation.

The below diagram shows a three (3) node Nutanix cluster with a Guest VM starting to perform write I/O, this is represented in a simplistic manor by the three (3) Diamonds (Red, Yellow and Purple)

NutanixWriteIOstart

The write I/O is written to the local SSD tier (as is every Write in a Nutanix environment) as shown below.

NutanixWriteDataWrittenLocal

Before acknowledging the write the Nutanix Controller VM (CVM) then replicates a copy of the data across the Nutanix Distributed File System.

The below diagram illustrates what this looks like in a three node cluster.

NutanixWriteSyncToOtherNodes

Once the data in successfully written to other nodes within the cluster, the Write acknowledgement is given. This ensures data is consistent and always protected.

In a Nutanix cluster, as Controllers (Nutanix CVMs) are scaled linearly with the ESXi hosts, Write I/O is then spread over more controllers, reducing the chance of contention in the environment at both a storage controller and network layer as each controller shares 2 x 10Gb connections per node.

In the event of a node failure, in a vSphere cluster, HA will restart the failed VM/s onto a surviving node in the cluster.

The VM will start-up and operate as normal and where data is not local to the node (as discussed in detail in my post  “Data Locality & Why is important for vSphere DRS clusters“) the data will initially be accessed over 10Gb before being replicated locally for future reads.

NutanixHAAfterWithDataAccess

All future writes for the VM/s which have been restarted by HA on different nodes will perform at a similar rate (if not the same rate) as they did before the failure depending on how many nodes are in the cluster. Where the Network is not a bottleneck, there should be minimal/no difference in write performance after a node failure.

The Nutanix cluster will also detect a node has failed, and ensure two copies of all data are available, and in the above example where only one copy of the data exists, the cluster will replicate the required data to ensure High Availability (“Replication Factor” of 2) is maintained.

As this replication is done across multiple controllers and nodes, it is much faster and lower impact than a traditional RAID rebuild which most of us will be familiar with.

The end state of this process looks like this.

NutanixHAEndState

So in conclusion using a “scale-out” storage controller solution like Nutanix ensures consistent high write performance even immediately following a node failure by eliminating the requirement for RAID style rebuilds which are disk intensive and can lead to “Double Disk Failures” and data loss.

The replication of data being distributed across all nodes in the cluster ensures minimal impact to each Nutanix controller, ESXi host and the network while ensuring the data is re-protected as soon as possible.

Related Articles

1. Data Locality & Why is important for vSphere DRS clusters

 

Example Architectural Decision – Storage I/O control for IaaS solutions

Problem Statement

In vSphere clusters servicing IaaS or Cloud workloads where customers or departments have the ability to self provision virtual machines with varying storage I/O requirements, how can the cluster be configured to ensure the most consistent virtual machine performance from a storage perspective?

Assumptions

1. vSphere 5.1 or later (to support both VMFS and NFS datastores and SIOC Automatic latency threshold computation)

Motivation

1. Ensure consistent storage performance for all virtual machines
2. Prevent a single virtual machine preventing other virtual machines reasonable access to storage

Architectural Decision

Enable Storage I/O control for all datastores and leave the shares values at the default setting for all virtual machines.

Set Tier 1 storage congestion threshold to 10ms – eg: SSD or SAS 15k RPM
Set Tier 2 storage congestion threshold to 20ms – eg: 15k or 10k SAS
Set Tier 3 storage congestion threshold to 30ms – eg: 7.2k SATA

Justification

1. In a IaaS or Cloud environment, it is important to prevent intentional or unintentional DoS type attacks; Storage I/O control will prevent such activities by giving equal access to the storage for all virtual machines attempting concurrent access.
2. Ensure no virtual machine/s monopolize the available I/O of the underlying storage eg: The noisy neighbor issue
3. Storage I/O control ensures consistent access across all ESXi hosts with access to the datastore, not just a single host. This ensures equal I/O access across the environment, not just across a single ESXi host.
4. Tier 1 should maintain lower latency than lower Tier disk, as such, a lower congestion threshold is advisable to ensure optimal performance for virtual machines hosted on Tier 1
5. Virtual machines requiring significant I/O will not be significantly impacted by Storage I/O control (assuming the congestion threshold is reached) as other VMs requiring access to storage will be able to access storage (thanks to Storage I/O control) and complete any required I/O in a timely manner and once the I/O is completed, no longer impact performance at all.
6. Virtual Machine not accessing storage regularly will not impact the VMs accessing storage regularly as Storage I/O control only acts on VMs accessing storage concurrently.
7. Leaving VMs with the default share value decreases administrative overhead and prevents human error granting significantly higher (or lower) share values which may negatively impact performance for one or more VMs

Implications

1. When using Storage DRS with SIOC the Storage DRS I/O latency setting needs to be carefully considered. Setting these value below the SOIC values (assuming Manual latency values are set) is recommended to ensure Storage DRS can work towards evenly balancing the storage workload and improving overall performance & SIOC then can help ensure consistent performance by taking action when the congestion threshold is reached to minimize latency spikes.

Alternatives

1. For vSphere 5.1 environments use the “Automatic Latency Threshold” by selecting the “Percentage of Peak Throughput” and setting the percentage value to “90%”. This setting is designed to minimize the change of a misaligned congestion threshold being manually set, therefore potentially reducing the effectiveness of SIOC
2. Not enable Storage I/O control
3. Enable Storage I/O control and set higher than default share values on critical VMs