What’s .NEXT? – Erasure Coding!

Up to now, Nutanix has used a concept known as “Replication Factor” or “RF” to provide storage layer data protection as opposed to older RAID technologies.

RF allows customers to configure either 2 or 3 copies of data depending on how critical the data is.

When using RF2, the usable capacity of RAW is 50% (RAW divide 2).

When using RF3, the usable capacity of RAW is 33% (RAW divide 3).

While these sound like large overheads, but in reality, they are comparable to traditional SAN/NAS deployments as explain in the two part post – Calculating Actual Usable capacity? It’s not as simple as you might think!

But enough on existing features, lets talk about an exciting new feature, Erasure coding!

Erasure coding (EC) is a technology which significantly increases the usable capacity in a Nutanix environment compared to RF2.

The overhead for EC depends on the cluster size but for clusters of 6 nodes or more it results in only a 1.25x overhead compared to 2x for RF2 and 3x for RF3.

For clusters of 3 to 4 nodes, the overhead is 1.5 and for clusters of 5 nodes 1.33.

The following shows a comparison between RF2 and EC for various cluster sizes.ErasureCodingAs you can see, the usable capacity is significantly increased when using Erasure Coding.

Now for more good news, in-line with Nutanix Uncompromisingly Simple philosophy, Erasure Coding can be enabled on existing Nutanix containers on the fly without downtime or the requirement to migrate data.

This means with a simple One-click upgrade to NOS 4.5, customers can get up to a 60% increase in usable capacity in addition to existing data reduction savings. e.g.: Compression.

So there you have it, more usable capacity for Nutanix customers with a non disruptive one click software upgrade…. (your welcome!).

For customers considering Nutanix, your cost per GB just dropped significantly!

Want more? Check out how to scale storage capacity separately from compute with Nutanix!

Related Articles:

1. Nutanix Erasure Coding (EC-X) Deep Dive

Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for QA / Pre-Production Servers

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for an environment running Quality Assurance or Pre-Production server workloads?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. Average Server VM is between 2-4vCPU and 4-8GB Ram with some larger.
5. Memory is the first compute level constraint.
6. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
7. vSphere 5.5 or earlier

Requirements

1. The environment must deliver consistent performance
2. Minimize the cost of shared storage

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Leave TPS disabled (default) and leave Large Memory pages enabled (default).

Justification

1. QA/Pre-Production environments should be as close as possible to the configuration of the actual production environment. This is to ensure consistency between QA/Pre-Production validation and production functionality and performance.
2. Setting 100% memory reservations ensures consistent performance by eliminating the possibility of swapping.
3. The 100% memory reservation also eliminates the capacity usage by the vswap file which saves space on the shared storage as well as reducing the impact on the storage in the event of swapping.
4. RAM is cheaper than Tier 1 storage (which is recommended for vSwap storage to ensure minimal performance impact during swapping) so the increased cost of memory in the hosts is easily offset by the saving in Tier 1 shared storage.
5. Simplicity. Leaving default settings is advantageous from both an architectural and operational perspective.  Example: ESXi Patching can cause settings to revert to default which could negate TPS savings and put a sudden high demand on storage where TPS savings are expected.
6. TPS savings for server workloads is typically much less than with desktop workloads and as a result less attractive.
7. The decision has been made to use 2 socket ESXi hosts and scale out so the TPS savings per host compared to a 4 socket server with double the RAM will be lower.
8. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM leading to more consistent performance under a wider range of circumstances.
9. Lower core count (and lower cost) CPUs will likely be viable as RAM will likely be the first constraint for further consolidation.
10. Remove the real or perceived security risk of sensitive information being gathered from other VMs using TPS as described in VMware KB 2080735

Implications

1. Using 100% memory reservations requires ESXi hosts and the cluster be sized at a 1:1 ratio of vRAM to pRAM (Physical RAM) and should include N+1 so a host failure can be tolerated.
2. Increased RAM costs
3. No memory overcommitment can be achieved
4. Potential for lower CPU utilization / overcommitment as RAM may become the first constraint.

Alternatives

1. Use 50% reservation and enable TPS
2. Use no reservation, Enable TPS and disable large pages

Related Articles:

1. Transparent Page Sharing (TPS) Example Architectural Decisions Register

2. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl (VCDX#104)

How to Architect a VSA , Nutanix or VSAN solution for >=N+1 availability.

How to architect a VSA, Nutanix or VSAN solution for the desired level of availability (i.e.: N+1 , N+2 etc) is a question I am asked regularly by customers and contacts throughout the industry.

This needs to be addressed in two parts.

1. Compute
2. Storage

Firstly, Compute level resiliency, As a cluster grows, the chances of a failure increases so the percentage of resources reserved for HA should increase with the size of the cluster.

My rule of thumb (which is quite conservative) is as follows:

1. N+1 for clusters of up to 8 hosts
2. N+2 for clusters of >8 hosts but <=16
3. N+3 for clusters of >16 hosts but <=24
4. N+4 for clusters of >24 hosts but <=32

The above is discussed in more detail in : Example Architectural Decision – High Availability Admission Control Setting and Policy

The below table highlights in Green my recommended HA percentage configuration based on the cluster size, up to the current vSphere limit of 32 nodes.

HApercentages

Some of you may be thinking, if my Nutanix or VSAN cluster is only configured for RF2 or FT1 for VSAN, I can only tolerate one node failure, so why am I reserving more than N+1.

In the case of Nutanix, after a node failure, the cluster can restore itself to a fully resilient state and tolerate subsequent failures. In fact, with “Block Awareness” a full 4 node block can be lost (so an N-4 situation) which if this is a requirement, needs to be considered for HA admission control reservations to ensure compute level resources are available to restart VMs.

Next lets talk about the issue perceived to be more complicated, Storage redundancy.

Storage redundancy for VSA, Nutanix or VSAN is actually not as complicated as most people think.

The following is my rule of thumb for sizing.

For N+1 , Ensure you have enough capacity remaining in the cluster to tolerate the largest node failing.

For N+2, Ensure you have enough capacity remaining in the cluster to tolerate the largest TWO nodes failing.

The examples below discuss Nutanix nodes and their capacity, but the same is applicable to any VSA or VSAN solution where multiple copies of data is kept for data protection, as opposed to RAID.

Example 1 , If you have 4 x Nutanix NX3060 nodes configured with RF2 (FT1 in VSAN terms) with 2TB usable per node (as shown below), in the event of a node failure, 2TB is no longer available. So the maximum storage utilization of the cluster should be <75% (6TB) to ensure in the event of any node failure, the cluster can be restored to a fully resilient state.

4node3060

Example 2 , If you have 2 x Nutanix NX3060 nodes configured with RF2 (FT1 in VSAN terms) with 2TB usable per node and 2 x Nutanix NX6060 nodes with 8TB usable per node (as shown below), in the event of a NX6060 node failure, 8TB is no longer available. So the maximum storage utilization of the cluster should be 12TB to ensure in the event of any node failure (including the 8TB nodes), the cluster can be restored to a fully resilient state.

4nodemixed

For environments using Nutanix RF3 (3 copies of data) or VSAN (FT2) the same rule of thumb applies but the usable capacity per node would be lower due to the increased capacity required for data protection.

Specifically for Nutanix environments, the PRISM UI shows if a cluster has sufficient capacity available to tolerate a node failure, and if not the following is displayed on the HOME screen and alerts can be sent if desired.

CapacityCritical

In this case, the cluster has suffered a node failure, and because it was sized suitably, it shows “Rebuild Capacity Available” as “Yes” and advises an “Auto Rebuild in progress” meaning the cluster is performing a fully automated self heal. Importantly no admin intervention is required!

If the cluster status is normal, the following will be shown in PRISM.

CapacityOK

In summary: The smaller the cluster the higher the amount of capacity needs to remain unused to enable resiliency to be restored in the event of a node failure, the same as the percentage of resources reserved for HA in a traditional compute only cluster.

The larger the cluster from both a storage and compute perspective, the lower the unused capacity is required for HA, so as has been a virtualization recommended practice for years….. Scale-out!

Related Articles:

1. Scale Out Shared Nothing Architecture Resiliency by Nutanix

2. PART 1 – Problems with RAID and Object Based Storage for data protection

3. PART 2 – Problems with RAID and Object Based Storage for data protection