Nutanix – Erasure Coding (EC-X) Deep Dive

Posted on June 19, 2015 by Josh Odgers

I published a post earlier this month during the .NEXT conference titled “What’s .NEXT? – Erasure Coding!” which covered the basics of Nutanix EC-X implementation.

This post is a deep drive follow on to answer numerous questions I have received about EC-X such as:

1. Does it work with Compression and De-duplication?
2. Can I use EC-X to reduce the overhead of RF3?
3. Does it work on Hot or Cold data?
4. Does it work only on the SATA tier?
5. What is the performance impact?
6. When should I use/not use EC-X?
7. What’s different about Nutanix (Patent pending) EC-X compared to other EC algorithms?
8. How does EC-X impact Data Locality?
9. What Hypervisors is EC-X supported with?

So let’s start with What’s different about Nutanix (Patent pending) EC-X compared to other EC algorithms?

* Nutanix EC-X is optimized for a distributed platform, where data is spread across nodes, not individual disks to ensure optimal performance. This also ensures rebuild times are faster and lower impact as the rebuild is performed across all the nodes/drives.

* Nutanix EC-X is also performed as a background task and only on Write Cold data meaning the configured RF is completed as normal and then as a post process EC-X is performed to ensure the write process is not potentially slowed by requiring numerous nodes within the cluster to participate in the initial write I/O.

How does EC-X affect existing Nutanix Data Reduction technologies.

* Short answer, EC-X is complimentary to both compression and deduplication so you will get even more data reduction. Here is a sample screen shot from the Home screen in PRISM which shows a breakdown of Dedup, Compression and Erasure Coding savings.

In the Storage Tab within PRISM, we can get further details on the capacity savings. Here we see an example Container with Compression and EC-X enabled:

Does it work only on the SATA tier?

No, EC-X works on all tiers, being SSD and SATA today, but in the future when newer technology or more than two tiers are used, EC-X works across all tiers.

Does EC-X work on Hot or Cold data?

EC-X waits until data written (via RF2 or RF3) is “Write Cold”, meaning the data is not being overwritten. The data might be white hot from a read I/O perspective, but as long as its not being overwritten the extent group (4MB) will be a candidate for EC-X.

This means for data which is Write Cold, the effective capacity of the SSD tier will be increased due to requiring less space thanks to EC-X.

What is the performance impact?

As EC-X is a post process task and EC-X waits until data is “Write Cold” before performing EC-X on the data, in general it will not impact the Write performance.

The exception to this is in the event data is Write Cold for a period of time, then the data is overwritten, this “overwrite” will incur a higher penalty than a typical RF2/RF3 write. As such some workloads may not be suitable for EC-X which I will discuss later.

Overall, if the workload is suitable, EC-X will keep the data in the SSD tier and the parity on the SATA tier which effectively extends the usable capacity of the SSD tier therefore helping to increase performance (as with compression and dedup).

What Hypervisors is EC-X supported with?

Everything in the Nutanix Distributed Storage Fabric (part of the Nutanix Xtreme Computing Platform or XCP) is designed to be hypervisor agnostic. So whatever Hypervisor/s you choose, you can benefit from EC-X!

How does EC-X impact Data Locality?

As the initial Write path is not impacted by enabling EC-X, Data Locality is still maintained and ensures one copy of data is written to the local node where the VM is running while replicating a further one or two copies (dependent on RF configuration) throughout the cluster.

This means that for newly written data as well as data being overwritten at frequencies of <60mins will always maintain data locality.

For data which meets the criteria for EC-X to be performed, such as Read Hot or Write Cold data, Data Locality can only be partially maintained as the data is by design striped across nodes. The result of this means that it is probable Read I/O will be performed over the network.

Importantly though Read Hot data will be maintained in the SSD tier and be distributed throughout the cluster. This means a single VMs read I/O can be served by multiple nodes concurrently which can lead to increased performance.

As EC-X also provides capacity savings, this allows for more data to be serviced by the SSD tier which enabled a larger active working set to perform at SSD speeds.

In summary, while Data Locality is not always maintained when using EC-X, the advantages of EC-X far outweigh the partial loss in Data Locality.

And finally, When should I use/not use EC-X?

As discussed earlier, EC-X is applied to Write Cold data and if/when that data is overwritten, the write penalty is higher than a typical RF2 write I/O. So if your dataset has a high percentage of overwrites, it is recommended not to use EC-X. The good news is storage can be assigned on a per VMDK level (or vDisk at the NDFS layer) so you can have one VM using EC-X for some data and RF2/3 for other data, again giving customers the best of both worlds.

The best workloads for EC-X are:

1. File Servers
2. Backup
3. Archive
4. Email
5. Logging

Summary:

Nutanix EC-X gives customers more choice without compromising functionality and performance while dramatically reduces the cost/GB of storage.

Data Locality & Read Cache – Why it’s critical for high performance Horizon View environments (Part 1)

Posted on September 24, 2013 by Josh Odgers

Lets start with describing an common performance issue with Horizon View (Formally VMware View) environments using Linked Clones.

1. Read and Write I/O for all Virtual Desktops is serviced by a central storage controller/s over a storage network (FC, FCoE , IP etc)

2. High performance Caching is on the storage controller which is not local to the ESXi host
3. Performance is 100% dependant on the Storage Area Network or Storage Controllers for performance and where latency or contention exists, performance will suffer.

The End Result is

1. Boot and/or login storms performance may be poor
2. Read I/O heavy tasks such as AV scans are a high impact to the ESXi host, Storage Network and Storage Controllers which can have a high impact on other workloads using the same shared storage (virtual and/or physical)

3. Steady state performance can be impacted poor (such as application load times)

One thing VMware has done to improve this situation is create Content Based Read Cache (CBRC a.k.a View Storage Accelerator).

CBRC has been shown in many tests to greatly improve the performance especially during boot storms, login storms and antivirus scans.

Andre Leibovici (@andreleibovici) wrote a great article on his blog “myvirtualcloud.net” showing the benefits on CBRC here which in summary showed that the I/O to the underlying storage was reduced and performance improved in all three of the above mentioned scenarios.

While CBRC does improve the situation, it does have some limitations.

1. Restricted to 2048MB (2GB)
2. The Bulk of the user data (Read I/O) and ALL Write I/O is still serviced remotely on the Storage Array over the Storage Area Network which still places a serious dependency on the Storage Network and (limited number of) Storage Controllers.
3. Virtual machine host caching is applied only when the virtual machine is powered off. Configuring host caching for a powered-on virtual machine requires virtual machine shutdown to apply the configuration. In the case of linked clone virtual machines, caching cannot be enabled when any of the virtual machines based on the same shared base disk are powered on.
4. Enabling host caching will create additional disks for each boot VMDK and snapshot VMDK, which results in increased storage utilization.
5. If the administrator changes the advanced configuration parameters, some changes might require the vSphere CBRC module to be unloaded and reloaded for the changes to take effect.
6. Host caching cannot be configured for non-vCenter managed pools, or terminal server pools.

Note: The Source for some of the above limitations is the VMware View Storage Accelerator – View 5.1 White paper which can be found here.

In a Nutanix environment, there is a number of features which go further to address these issue. The first is called “Extent Cache” which is a READ cache in DRAM of each Controller VM (CVM). The “Extent Cache” is by default 3072MB but can be increased to whatever size suits your environment.

The below is a visual representation of the Nutanix Extent Cache.

Some of the benefits of Nutanix Extent Cache are

1. Can be sized to suit your requirements (Not limited to 2GB)
2. Does not require the use of CBRC
3. Reduced overhead on the Storage Network (IP Network) as more read I/O can be cached locally
4. Due to Nutanix DFS “Data locality”, data that not stored in Extent Cache is generally accessed locally via SSD. This further reduces the overhead on the Storage Network & dramatically reduces the impact on other controllers within the Nutanix cluster- See my post Data Locality & Why is important for vSphere DRS clusters for more information about Data locality
5. During boot storms, login storms and antivirus scans more data can be served from Cache (Extent Cache) and less read I/O is forced to be served by SSD (or local SATA drives if the data is “Cold”). This not only improves Read performance but makes more I/O available for Write operations which are generally >=65% in VDI environments
6. Works with any Virtual Machine, not just VDI and is hypervisor agnostic.
7. No need to configure a View Desktop Pool to use “Host Caching” as the “Extent Cache” performs this function (and more) at the Nutanix layer automatically
8. No need to configure “Regenerate Cache” or “Blackout Times” which is required for CBRC
9. No additional storage is used by Nutanix Extent Cache
10. Changes to the Nutanix Extent Cache can be done without disruption to the Virtual machines

Note: Extent cache is not limited to desktop workloads, it also works with any type of virtual machine and is operating system agnostic.

In Part 2, We discuss the Nutanix Dynamic Shadow’s feature will be discussed to show how Nutanix ensures data locality for 100% read only data such as Linked Clone Replica’s.

A special Thank you to Jason Langone VCDX#54 (@langonej) for reviewing this post and to Tabrez Memon one of the brilliant Engineers at Nutanix who has worked on features discussed in this post and provided valuable input into this series.

Data Locality & Why is important for vSphere DRS clusters

Posted on September 19, 2013 by Josh Odgers

I have had a lot of people reach out to me since VMworld SFO, where I was interviewed by Eric Sloof (@esloof) on VMworldTV (interview can be seen here) about Nutanix.

So I thought I would expand on the topic of Data Locality and why it is so important for vSphere DRS clusters to maintain consistent high performance and low latency.

So first, the below diagram shows three (3) Nutanix nodes, and one (1) Guest VM.

The guest VM is reading data from the local storage in the Nutanix node and as a result this read access is very fast. The read I/O will be served from one of 4 places.

1. Extent Cache (DRAM – For “Active Working Set”)
2. Local SSD (For “Active Working Set”)
3. Local SATA (Only for “Cold” data)

and the forth we will discuss is a moment.

So as a result for Read I/O

1. There is no dependency on a Storage Area Network (FCoE, IP, FC etc)
2. Read I/O from one node does not contend with another node
3. Read I/O is very low latency as it does not leave the ESXi host
4. More Network bandwidth is available for Virtual Machine traffic, ESXi Mgmt, vMotion , FT etc

But wait, the what happens if DRS (or a vSphere admin) vMotion’s a VM to another node? – I’m glad you asked!

The below shows what happens immediately after a vMotion

As you can see, only the Purple data is local to the new node, so transparently to the virtual machine, if/when remote data is required by the VM (ie: The VMs “Active Working Set”) the Nutanix controller VM (CVM) will get the requested data over the 10GB Network in 1MB extents. (It does not do a bulk movement or “Storage vMotion” type movement of all the VMs data EVER!)

And, all future Write I/O will be written local, so future Read I/O will all be local by default.

So, the worst case scenario for a read I/O in a Nutanix environment, is that the required data is not available locally and the CVM will access the data over a 10GB network.

This scenario will only occur in situations where

1. Maintenance is occurring and hosts are rebooted
2. A Host Failure (HA restarts VM on another node)
3. Following a vMotion

Generally in BAU (Business as Usual) operation Read I/O should be local in the high 90% range.

So the worst case scenario for Read I/O on a vSphere Cluster running on Nutanix, is actually the Best case scenario for a traditional storage array, because in a traditional array all data is accessed over some form of storage network and generally via a small number of controllers.

It is important to note, the Nutanix DFS (Distributed File System) only accesses data over the network when its required by the VM at a granular (1MB extent) level. So only the “Active Working Set” will be accessed over the 10Gb network, before being copied locally, again in 1MB extents. So if the data is not “Active” having it remotely does not impact performance at all so moving the data would create an overhead on the environment for no benefit.

In the event 90% of a VMs data is on a remote node, but the “Active Working Set” is local, read performance will all be at local speeds, again from Extent Cache (DRAM), Local SSD or Local SATA (for “cold” data).

Now some vendors are working on or have some local caching capabilities which in my experience are not widely deployed and have various caveats such as Operating System version, and in guest drivers, but for the vast majority of environments today, these technologies are not deployed.

The Nutanix DFS has data locality built in, it works with any hypervisor , Guest OS and does not require any configuration.

So now you know why ensuring the Active Working Set (data) is as close to the VM is essential for consistent high performance and low latency.

Related Articles

1. Write I/O Performance & High Availability in a scale-out Distributed File System

CloudXC

By Josh Odgers – VMware Certified Design Expert (VCDX) #90

Tag Archives: Data Locality

Nutanix – Erasure Coding (EC-X) Deep Dive

Data Locality & Why is important for vSphere DRS clusters

Share this:

Share this:

Share this: