Write I/O Performance & High Availability in a scale-out Distributed File System

Following on from my recent post titled “Data Locality & Why is important for vSphere DRS clusters” I would like to discuss at a high level how Write I/O works in the Nutanix Distributed File System, how the solution ensures high availability in the event of a node failure and what impact a failure has on performance.

Lets start with a typical Write operation.

The below diagram shows a three (3) node Nutanix cluster with a Guest VM starting to perform write I/O, this is represented in a simplistic manor by the three (3) Diamonds (Red, Yellow and Purple)

NutanixWriteIOstart

The write I/O is written to the local SSD tier (as is every Write in a Nutanix environment) as shown below.

NutanixWriteDataWrittenLocal

Before acknowledging the write the Nutanix Controller VM (CVM) then replicates a copy of the data across the Nutanix Distributed File System.

The below diagram illustrates what this looks like in a three node cluster.

NutanixWriteSyncToOtherNodes

Once the data in successfully written to other nodes within the cluster, the Write acknowledgement is given. This ensures data is consistent and always protected.

In a Nutanix cluster, as Controllers (Nutanix CVMs) are scaled linearly with the ESXi hosts, Write I/O is then spread over more controllers, reducing the chance of contention in the environment at both a storage controller and network layer as each controller shares 2 x 10Gb connections per node.

In the event of a node failure, in a vSphere cluster, HA will restart the failed VM/s onto a surviving node in the cluster.

The VM will start-up and operate as normal and where data is not local to the node (as discussed in detail in my post  “Data Locality & Why is important for vSphere DRS clusters“) the data will initially be accessed over 10Gb before being replicated locally for future reads.

NutanixHAAfterWithDataAccess

All future writes for the VM/s which have been restarted by HA on different nodes will perform at a similar rate (if not the same rate) as they did before the failure depending on how many nodes are in the cluster. Where the Network is not a bottleneck, there should be minimal/no difference in write performance after a node failure.

The Nutanix cluster will also detect a node has failed, and ensure two copies of all data are available, and in the above example where only one copy of the data exists, the cluster will replicate the required data to ensure High Availability (“Replication Factor” of 2) is maintained.

As this replication is done across multiple controllers and nodes, it is much faster and lower impact than a traditional RAID rebuild which most of us will be familiar with.

The end state of this process looks like this.

NutanixHAEndState

So in conclusion using a “scale-out” storage controller solution like Nutanix ensures consistent high write performance even immediately following a node failure by eliminating the requirement for RAID style rebuilds which are disk intensive and can lead to “Double Disk Failures” and data loss.

The replication of data being distributed across all nodes in the cluster ensures minimal impact to each Nutanix controller, ESXi host and the network while ensuring the data is re-protected as soon as possible.

Related Articles

1. Data Locality & Why is important for vSphere DRS clusters

 

Data Locality & Why is important for vSphere DRS clusters

I have had a lot of people reach out to me since VMworld SFO, where I was interviewed by Eric Sloof (@esloof) on VMworldTV (interview can be seen here) about Nutanix.

So I thought I would expand on the topic of Data Locality and why it is so important for vSphere DRS clusters to maintain consistent high performance and low latency.

So first, the below diagram shows three (3) Nutanix nodes, and one (1) Guest VM.

NutanixLocalRead

The guest VM is reading data from the local storage in the Nutanix node and as a result this read access is very fast. The read I/O will be served from one of 4 places.

1. Extent Cache (DRAM – For “Active Working Set”)
2. Local SSD (For “Active Working Set”)
3. Local SATA (Only for “Cold” data)

and the forth we will discuss is a moment.

So as a result for Read I/O

1. There is no dependency on a Storage Area Network (FCoE, IP, FC etc)
2. Read I/O from one node does not contend with another node
3. Read I/O is very low latency as it does not leave the ESXi host
4. More Network bandwidth is available for Virtual Machine traffic, ESXi Mgmt, vMotion , FT etc

But wait, the what happens if DRS (or a vSphere admin) vMotion’s a VM to another node? – I’m glad you asked!

The below shows what happens immediately after a vMotion

NutanixAftervmotion

As you can see, only the Purple data is local to the new node, so transparently to the virtual machine, if/when remote data is required by the VM (ie: The VMs “Active Working Set”) the Nutanix controller VM (CVM) will get the requested data over the 10GB Network in 1MB extents. (It does not do a bulk movement or “Storage vMotion” type movement of all the VMs data EVER!)

And, all future Write I/O will be written local, so future Read I/O will all be local by default.

So, the worst case scenario for a read I/O in a Nutanix environment, is that the required data is not available locally and the CVM will access the data over a 10GB network.

This scenario will only occur in situations where

1. Maintenance is occurring and hosts are rebooted
2. A Host Failure (HA restarts VM on another node)
3. Following a vMotion

Generally in BAU (Business as Usual) operation Read I/O should be local in the high 90% range.

So the worst case scenario for Read I/O on a vSphere Cluster running on Nutanix, is actually the Best case scenario for a traditional storage array, because in a traditional array all data is accessed over some form of storage network and generally via a small number of controllers.

It is important to note, the Nutanix DFS (Distributed File System) only accesses data over the network when its required by the VM at a granular (1MB extent) level. So only the “Active Working Set” will be accessed over the 10Gb network, before being copied locally, again in 1MB extents. So if the data is not “Active” having it remotely does not impact performance at all so moving the data would create an overhead on the environment for no benefit.

In the event 90% of a VMs data is on a remote node, but the “Active Working Set” is local, read performance will all be at local speeds, again from Extent Cache (DRAM), Local SSD or Local SATA (for “cold” data).

Now some vendors are working on or have some local caching capabilities which in my experience are not widely deployed and have various caveats such as Operating System version, and in guest drivers, but for the vast majority of environments today, these technologies are not deployed.

The Nutanix DFS has data locality built in, it works with any hypervisor , Guest OS and does not require any configuration.

So now you know why ensuring the Active Working Set (data) is as close to the VM is essential for consistent high performance and low latency.

Related Articles

1. Write I/O Performance & High Availability in a scale-out Distributed File System

VUM issues? – Remediating an ESXi 5.x host fails (Error code:15)

I just had a most annoying issue with a fresh installed ESXi 5.1 host, release 1065491.

When remediating the Host using VUM I received the below error.

Remediating an ESXi 5.x host fails with the error: The host returns esxupdate error code:15. The package manager transaction is not successful (2030665)

After a quick google I came across this KB

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2030665

However one of the steps (below) required I copy files from a working 5.1 host to resolve the issue. Since I didn’t have a working 5.1 host in this environment I was stuck, but decided to proceed and see if I could resolve it without this step.

“5. Use WinSCP to copy the folders and files from the / locker/packages/5.0.0/ or / locker/packages/5.1.0/directory on a working host to the affected host.”

However if you skip the above step, and follow the instructions below you will be able to remediate your hosts.

Note: The below is a slightly modified version from the KB listed above.

  1. Put the host in the Maintenance Mode. (OPTIONAL – although recommended)
  2. Navigate to the /locker/packages/5.0.0/ folder on the host (or /locker/packages/5.1.0/ on an ESXi 5.1 host).
  3. Rename the folder to /locker/packages/5.0.0.old (or /locker/packages/5.1.0.old on an ESXi 5.1 host).
  4. Recreate the folder, as user root and run the command:

    For ESXi 5.0:

    mkdir / locker/packages/5.0.0/

    For ESXi 5.1:

    mkdir / locker/packages/5.1.0/

  • Verify and ensure that there is sufficient free space on root folder using this command:

    vdf -h

  • Check the locker location using this command:

    ls -ltr /

    If the locker is not pointing to a datastore:

    1. Rename the old locker file using this command:

      mv /locker /locker.old

    2. Recreate the symbolic link using this command:

      ln -s /store/locker /locker

Now retry remediating your hosts and you should be successful.