Nutanix – Erasure Coding (EC-X) Deep Dive

I published a post earlier this month during the .NEXT conference titled “What’s .NEXT? – Erasure Coding!” which covered the basics of Nutanix EC-X implementation.

This post is a deep drive follow on to answer numerous questions I have received about EC-X such as:

1. Does it work with Compression and De-duplication?
2. Can I use EC-X to reduce the overhead of RF3?
3. Does it work on Hot or Cold data?
4. Does it work only on the SATA tier?
5. What is the performance impact?
6. When should I use/not use EC-X?
7. What’s different about Nutanix (Patent pending) EC-X compared to other EC algorithms?
8. How does EC-X impact Data Locality?
9. What Hypervisors is EC-X supported with?

So let’s start with What’s different about Nutanix (Patent pending) EC-X compared to other EC algorithms?

* Nutanix EC-X is optimized for a distributed platform, where data is spread across nodes, not individual disks to ensure optimal performance. This also ensures rebuild times are faster and lower impact as the rebuild is performed across all the nodes/drives.

* Nutanix EC-X is also performed as a background task and only on Write Cold data meaning the configured RF is completed as normal and then as a post process EC-X is performed to ensure the write process is not potentially slowed by requiring numerous nodes within the cluster to participate in the initial write I/O.

How does EC-X affect existing Nutanix Data Reduction technologies.

* Short answer, EC-X is complimentary to both compression and deduplication so you will get even more data reduction. Here is a sample screen shot from the Home screen in PRISM which shows a breakdown of Dedup, Compression and Erasure Coding savings.

CapacityOptimization

In the Storage Tab within PRISM, we can get further details on the capacity savings. Here we see an example Container with Compression and EC-X enabled:

CompplusECXhighlighted

Does it work only on the SATA tier?

No, EC-X works on all tiers, being SSD and SATA today, but in the future when newer technology or more than two tiers are used, EC-X works across all tiers.

Does EC-X work on Hot or Cold data?

EC-X waits until data written (via RF2 or RF3) is “Write Cold”, meaning the data is not being overwritten. The data might be white hot from a read I/O perspective, but as long as its not being overwritten the extent group (4MB) will be a candidate for EC-X.

This means for data which is Write Cold, the effective capacity of the SSD tier will be increased due to requiring less space thanks to EC-X.

What is the performance impact?

As EC-X is a post process task and EC-X waits until data is “Write Cold” before performing EC-X on the data, in general it will not impact the Write performance.

The exception to this is in the event data is Write Cold for a period of time, then the data is overwritten, this “overwrite” will incur a higher penalty than a typical RF2/RF3 write. As such some workloads may not be suitable for EC-X which I will discuss later.

Overall, if the workload is suitable, EC-X will keep the data in the SSD tier and the parity on the SATA tier which effectively extends the usable capacity of the SSD tier therefore helping to increase performance (as with compression and dedup).

What Hypervisors is EC-X supported with?

Everything in the Nutanix Distributed Storage Fabric (part of the Nutanix Xtreme Computing Platform or XCP) is designed to be hypervisor agnostic. So whatever Hypervisor/s you choose, you can benefit from EC-X!

How does EC-X impact Data Locality?

As the initial Write path is not impacted by enabling EC-X, Data Locality is still maintained and ensures one copy of data is written to the local node where the VM is running while replicating a further one or two copies (dependent on RF configuration) throughout the cluster.

This means that for newly written data as well as data being overwritten at frequencies of <60mins will always maintain data locality.

For data which meets the criteria for EC-X to be performed, such as Read Hot or Write Cold data, Data Locality can only be partially maintained as the data is by design striped across nodes. The result of this means that it is probable Read I/O will be performed over the network.

Importantly though Read Hot data will be maintained in the SSD tier and be distributed throughout the cluster. This means a single VMs read I/O can be served by multiple nodes concurrently which can lead to increased performance.

As EC-X also provides capacity savings, this allows for more data to be serviced by the SSD tier which enabled a larger active working set to perform at SSD speeds.

In summary, while Data Locality is not always maintained when using EC-X, the advantages of EC-X far outweigh the partial loss in Data Locality.

And finally, When should I use/not use EC-X?

As discussed earlier, EC-X is applied to Write Cold data and if/when that data is overwritten, the write penalty is higher than a typical RF2 write I/O. So if your dataset has a high percentage of overwrites, it is recommended not to use EC-X. The good news is storage can be assigned on a per VMDK level (or vDisk at the NDFS layer) so you can have one VM using EC-X for some data and RF2/3 for other data, again giving customers the best of both worlds.

The best workloads for EC-X are:

1. File Servers
2. Backup
3. Archive
4. Email
5. Logging

Summary:

Nutanix EC-X gives customers more choice without compromising functionality and performance while dramatically reduces the cost/GB of storage.

Related Articles:

  1. Large scale clusters and increased resiliency with RF3 + EC-X
  2. What I/O will Nutanix Erasure coding (EC-X) take effect on?

  3. Sizing assumptions for solutions with Erasure Coding (EC-X)

Data Locality & Why is important for vSphere DRS clusters

I have had a lot of people reach out to me since VMworld SFO, where I was interviewed by Eric Sloof (@esloof) on VMworldTV (interview can be seen here) about Nutanix.

So I thought I would expand on the topic of Data Locality and why it is so important for vSphere DRS clusters to maintain consistent high performance and low latency.

So first, the below diagram shows three (3) Nutanix nodes, and one (1) Guest VM.

NutanixLocalRead

The guest VM is reading data from the local storage in the Nutanix node and as a result this read access is very fast. The read I/O will be served from one of 4 places.

1. Extent Cache (DRAM – For “Active Working Set”)
2. Local SSD (For “Active Working Set”)
3. Local SATA (Only for “Cold” data)

and the forth we will discuss is a moment.

So as a result for Read I/O

1. There is no dependency on a Storage Area Network (FCoE, IP, FC etc)
2. Read I/O from one node does not contend with another node
3. Read I/O is very low latency as it does not leave the ESXi host
4. More Network bandwidth is available for Virtual Machine traffic, ESXi Mgmt, vMotion , FT etc

But wait, the what happens if DRS (or a vSphere admin) vMotion’s a VM to another node? – I’m glad you asked!

The below shows what happens immediately after a vMotion

NutanixAftervmotion

As you can see, only the Purple data is local to the new node, so transparently to the virtual machine, if/when remote data is required by the VM (ie: The VMs “Active Working Set”) the Nutanix controller VM (CVM) will get the requested data over the 10GB Network in 1MB extents. (It does not do a bulk movement or “Storage vMotion” type movement of all the VMs data EVER!)

And, all future Write I/O will be written local, so future Read I/O will all be local by default.

So, the worst case scenario for a read I/O in a Nutanix environment, is that the required data is not available locally and the CVM will access the data over a 10GB network.

This scenario will only occur in situations where

1. Maintenance is occurring and hosts are rebooted
2. A Host Failure (HA restarts VM on another node)
3. Following a vMotion

Generally in BAU (Business as Usual) operation Read I/O should be local in the high 90% range.

So the worst case scenario for Read I/O on a vSphere Cluster running on Nutanix, is actually the Best case scenario for a traditional storage array, because in a traditional array all data is accessed over some form of storage network and generally via a small number of controllers.

It is important to note, the Nutanix DFS (Distributed File System) only accesses data over the network when its required by the VM at a granular (1MB extent) level. So only the “Active Working Set” will be accessed over the 10Gb network, before being copied locally, again in 1MB extents. So if the data is not “Active” having it remotely does not impact performance at all so moving the data would create an overhead on the environment for no benefit.

In the event 90% of a VMs data is on a remote node, but the “Active Working Set” is local, read performance will all be at local speeds, again from Extent Cache (DRAM), Local SSD or Local SATA (for “cold” data).

Now some vendors are working on or have some local caching capabilities which in my experience are not widely deployed and have various caveats such as Operating System version, and in guest drivers, but for the vast majority of environments today, these technologies are not deployed.

The Nutanix DFS has data locality built in, it works with any hypervisor , Guest OS and does not require any configuration.

So now you know why ensuring the Active Working Set (data) is as close to the VM is essential for consistent high performance and low latency.

Related Articles

1. Write I/O Performance & High Availability in a scale-out Distributed File System