Nutanix – Erasure Coding (EC-X) Deep Dive

I published a post earlier this month during the .NEXT conference titled “What’s .NEXT? – Erasure Coding!” which covered the basics of Nutanix EC-X implementation.

This post is a deep drive follow on to answer numerous questions I have received about EC-X such as:

1. Does it work with Compression and De-duplication?
2. Can I use EC-X to reduce the overhead of RF3?
3. Does it work on Hot or Cold data?
4. Does it work only on the SATA tier?
5. What is the performance impact?
6. When should I use/not use EC-X?
7. What’s different about Nutanix (Patent pending) EC-X compared to other EC algorithms?
8. How does EC-X impact Data Locality?
9. What Hypervisors is EC-X supported with?

So let’s start with What’s different about Nutanix (Patent pending) EC-X compared to other EC algorithms?

* Nutanix EC-X is optimized for a distributed platform, where data is spread across nodes, not individual disks to ensure optimal performance. This also ensures rebuild times are faster and lower impact as the rebuild is performed across all the nodes/drives.

* Nutanix EC-X is also performed as a background task and only on Write Cold data meaning the configured RF is completed as normal and then as a post process EC-X is performed to ensure the write process is not potentially slowed by requiring numerous nodes within the cluster to participate in the initial write I/O.

How does EC-X affect existing Nutanix Data Reduction technologies.

* Short answer, EC-X is complimentary to both compression and deduplication so you will get even more data reduction. Here is a sample screen shot from the Home screen in PRISM which shows a breakdown of Dedup, Compression and Erasure Coding savings.

CapacityOptimization

In the Storage Tab within PRISM, we can get further details on the capacity savings. Here we see an example Container with Compression and EC-X enabled:

CompplusECXhighlighted

Does it work only on the SATA tier?

No, EC-X works on all tiers, being SSD and SATA today, but in the future when newer technology or more than two tiers are used, EC-X works across all tiers.

Does EC-X work on Hot or Cold data?

EC-X waits until data written (via RF2 or RF3) is “Write Cold”, meaning the data is not being overwritten. The data might be white hot from a read I/O perspective, but as long as its not being overwritten the extent group (4MB) will be a candidate for EC-X.

This means for data which is Write Cold, the effective capacity of the SSD tier will be increased due to requiring less space thanks to EC-X.

What is the performance impact?

As EC-X is a post process task and EC-X waits until data is “Write Cold” before performing EC-X on the data, in general it will not impact the Write performance.

The exception to this is in the event data is Write Cold for a period of time, then the data is overwritten, this “overwrite” will incur a higher penalty than a typical RF2/RF3 write. As such some workloads may not be suitable for EC-X which I will discuss later.

Overall, if the workload is suitable, EC-X will keep the data in the SSD tier and the parity on the SATA tier which effectively extends the usable capacity of the SSD tier therefore helping to increase performance (as with compression and dedup).

What Hypervisors is EC-X supported with?

Everything in the Nutanix Distributed Storage Fabric (part of the Nutanix Xtreme Computing Platform or XCP) is designed to be hypervisor agnostic. So whatever Hypervisor/s you choose, you can benefit from EC-X!

How does EC-X impact Data Locality?

As the initial Write path is not impacted by enabling EC-X, Data Locality is still maintained and ensures one copy of data is written to the local node where the VM is running while replicating a further one or two copies (dependent on RF configuration) throughout the cluster.

This means that for newly written data as well as data being overwritten at frequencies of <60mins will always maintain data locality.

For data which meets the criteria for EC-X to be performed, such as Read Hot or Write Cold data, Data Locality can only be partially maintained as the data is by design striped across nodes. The result of this means that it is probable Read I/O will be performed over the network.

Importantly though Read Hot data will be maintained in the SSD tier and be distributed throughout the cluster. This means a single VMs read I/O can be served by multiple nodes concurrently which can lead to increased performance.

As EC-X also provides capacity savings, this allows for more data to be serviced by the SSD tier which enabled a larger active working set to perform at SSD speeds.

In summary, while Data Locality is not always maintained when using EC-X, the advantages of EC-X far outweigh the partial loss in Data Locality.

And finally, When should I use/not use EC-X?

As discussed earlier, EC-X is applied to Write Cold data and if/when that data is overwritten, the write penalty is higher than a typical RF2 write I/O. So if your dataset has a high percentage of overwrites, it is recommended not to use EC-X. The good news is storage can be assigned on a per VMDK level (or vDisk at the NDFS layer) so you can have one VM using EC-X for some data and RF2/3 for other data, again giving customers the best of both worlds.

The best workloads for EC-X are:

1. File Servers
2. Backup
3. Archive
4. Email
5. Logging

Summary:

Nutanix EC-X gives customers more choice without compromising functionality and performance while dramatically reduces the cost/GB of storage.

Related Articles:

  1. Large scale clusters and increased resiliency with RF3 + EC-X
  2. What I/O will Nutanix Erasure coding (EC-X) take effect on?

  3. Sizing assumptions for solutions with Erasure Coding (EC-X)

Acropolis: VM High Availability (HA)

This past week at Nutanix .NEXT, Acropolis was officially announced although it has actually been available and running in many customer environments (1200+ nodes globally) for a long time.

One of the new features is VM High Availability.

As with everything Nutanix, VM HA is a very simple yet effective feature. Let’s go through how to configure HA via the Acropolis/PRISM HTML 5 interface.

As shown below, using the “Options” menu represented by the cog, there is an option called “Manage VM High Availability”.

HAMenu

The Manage VM High Availability has 2 simple options shown below:

1. Enable VM High Availability (On/Off)
2: Best Effort / Reserve Space

Best Effort works as you might expect where in the event of a node failure, VMs are powered on throughout the cluster if resources are available.

In the event resources e.g.: Memory, are not available then some/all VMs may not be powered on.

HAonBestEffort

Reserve Space also works as you might expect by reserving enough compute capacity within the cluster to tolerate either one or two node failures. If RF2 is configured then one node is reserved and if RF3 is in use, two nodes are reserved.

Pretty simple right!

HAonReserveSpace

The best part about Reserve Space is its like “Host failures cluster tolerates” in vSphere, however without using the potentially inefficient slot size algorithm.

Once HA is enabled, it appears on the Home screen of PRISM and gives a summary of the VMs which are On,Off and Suspended as shown below.

HAHomeScreen

HA can also be enabled/disabled on a per VM basis via the VMs tab. Simply highlight the VM and click “Update” as shown below.

VMHAupdate

Then you will see the “Update VM” popup appear. Then simply Enable HA.

VMHA

In the above screenshot you can see that the popup also warns you if HA is disabled at the cluster level and allows you to jump straight to the Manage VM High Availability configuration menu.

So there you have it, Acropolis VM High Availability, simple as that.

Related Articles:

1. Acropolis: Scalability
2. What’s .NEXT? – Acropolis!
3. What’s .NEXT? – Erasure Coding!

 

 

Acropolis: Scalability

One of the major focuses for Nutanix both for our Distributed Storage Fabric (part of the Nutanix Xtreme Computing Platform or XCP) has been scalability with consistent performance.

Predictable scalability is critical to any distributed platform as it predictable scalability for the management layer.

This is one of the many strengths of the Acropolis management layer.

All components which are required to Configure, Manage, Monitor, Scale and Automate are fully distributed across all nodes within the cluster.

As a result, there is no single point of failure with the Nutanix/Acropolis management layer.

Lets take a look at a typical four node cluster:

Below we see four Controller VMs (CVMs) which service one node each. In the cluster we have an Acropolis Master along with multiple Acropolis Slave instances.

Acropolis4nodecluster1

In the event the Acropolis Master becomes unavailable for any reason, an election will take place and one of the Acropolis Slaves will be promoted to Master.

This can be achieved because Acropolis data is stored in a fully distributed Cassandra database which is protected by the Distributed Storage Fabric.

When an additional Nutanix node is added to the cluster, an Acropolis Slave is also added which allows the workload of managing the cluster to be distributed, therefore ensuring management never becomes a point of contention.Acropolis5NodeCluster

Things like performance monitoring, stats collection, Virtual Machine console proxy connections are just a few of the management tasks which are serviced by Master and Slave instances.

Another advantage of Acropolis is that the management layer never needs to be sized or scaled manually. There is no vApp/s , Database Server/s, Windows instances to deploy, install, configure, manage or license, therefore reducing cost and simplifying management of the environment.

Summary:

Acropolis Management is automatically scaled as nodes are added to the cluster, therefore increasing consistency , resiliency, performance and eliminating potential for architectural (sizing) errors which may impact manageability.

Note: For non-Acropolis deployments, PRISM is also scaled in the same manner as described above, however the scalability of Hypervisor management layers such as vCenter or SCVMM will need to be considered separately when not using Acropolis.