Calculating Actual Usable capacity? It’s not as simple as you might think! – Part 2 Nutanix

In Part 1, the example provided showed usable capacity for a SAN/NAS using a combination of RAID 10, RAID 5 and RAID 6 along with the various sizing considerations resulted in 35.68TB usable capacity or approx 1/3rd of the RAW 100TB.

In Part 2 we will discuss the misconception that Nutanix (a Hyper-converged platform) provides lower effective usable capacity compared to SAN or NAS solutions.

At a high level, Nutanix uses Replication Factor 2 (RF2) which has the same overhead as RAID 1 so straight away a lot of people jump to the conclusion that the usable capacity is less that a traditional SAN/NAS because *insert your favourite RAID level here* has less overhead.

Let’s say we have a Nutanix cluster with 100TB Raw storage using the most common node type, the NX3050.

Now let’s address the same points as we did in Part 1 for the SAN/NAS example:

So starting with the same 100TB RAW as we did for the SAN/NAS example and see where things end up on Nutanix.

1. Deducting hot spare drives

Nutanix does not use hot spare drives, data is balanced across all drives in the “Storage Pool”. To cater for failure, it is recommended to size for N+1 for Resiliency Factor 2 (RF2) deployments. If we we’re using NX3050 nodes (the most popular Nutanix node) then the overhead of N+1 would be ~4.8TB RAW.

100TB – N+1 Node (4.8TB RAW) = 95.2TB

2. RAID Overhead

Nutanix doesn’t use RAID, but the Replication Factor 2 has an overhead is 50% (the same as RAID10).

95.2TB – 50% (RF2) = 47.6TB remaining

3. Free Space on the platform required to ensure performance

For Nutanix all write I/O goes to either the Extent Storage or Oplog, both of which are housed on the SSD tier. All random writes are serviced by the Oplog until it reaches 95% capacity at which point the oplog is bypassed.

As such, performance remains high until 95% capacity. Therefore only 5% free capacity is required to ensure high performance.

47.6TB – 5% (Free space for performance) = 45.2TB

FYI: Nutanix Performance and Engineering team members including myself typically conduct benchmarks at greater than 90% cluster capacity.

4. Free space per LUN

Nutanix does not use LUNs. Nutanix presents containers to the hypervisor. All containers are thin provisioned and all containers can use all available space in the storage pool. Meaning free space only needs to be managed at the Storage Pool layer, not at each individual container.

As we have already taken into account the 5% free space there is no need to take another 5% of space therefore we remain at 45.2TB usable.

5. Free space per VMDK

As with physical servers and SAN/NAS environments, we don’t want our VMs drives running out of capacity, as a result it is common to size VMDKs well above what is strictly required to make capacity management (operational tasks) easier.

As mentioned in Part 1, I typically see architects recommending upwards of 10-20% free space per VMDK over and above what is required to account for unexpected growth, OS patching etc. This makes perfect sense for the same reason as we have free space per LUN because if space runs out for a VM, it’s another bad day for I.T.

For this example, I will assume the same 10% free space per VMDK as I did for SAN/NAS example, the difference with Nutanix is performance remains the same regardless of the VMDK being Thick or Thin provisioned, so with every VM Thin Provisioned, no capacity is required to be reserved for free space within VMDK files as it would be for traditional environments requiring Eager Zero Thick VMDKs for performance..

So we’re still at 45.2TB usable.

Now where are we at?

So far, the first 5 points are fairly easy to calculate.

Next we will look at various factors which further reduce usable capacity for SAN/NAS and see how they apply to Nutanix.

6. Silos for Performance

Nutanix does not require nor recommend silos being created for performance reasons. All VMs can reside in a single container therefore no capacity is unusable as a result of performance requirements.

As no silos are required for maximum performance, we are still at 45.2TB usable.

7. Silos of (or Fragmented) Usable Capacity

Nutanix does not configure usable capacity to containers, a container can use all the available storage in the underlying Storage Pool. Where multiple containers are provisioned, each container can see the total capacity of the storage pool while providing logical separation of the VMs within the containers. This avoids the issue of fragmented free capacity.

The diagram below shows 5 containers hosted by an example Nutanix cluster (Storage Pool) with 100TB total capacity, each container has a capacity of 100TB and 25TB free space in alignment with the underlying storage pool.

NutanixFreeSpace

In this case, when creating a new VM, or adding or expanding VMDKs for existing VMs, it does not matter which container we place the VM, as long as it is less than the 25TB available in the pool, it makes no difference to capacity.

This removes the requirement for complex capacity management, or using Storage DRS and Storage vMotion.

So we’re still at 45.2TB usable.

Other factors which reduce usable capacity?

8. LUN Provisioning Type

In many cases, especially when talking about high performance applications, storage vendors recommend using Thick Provisioned LUNs and as mentioned in Part 1, It’s anyone’s guess how much space is wasted as a result.

But with Nutanix, all containers are Thin Provisioned so no capacity is wasted on Thick Provisioning and performance is optimal

9. Wasted Capacity from using SSDs as Cache

Nutanix does not use SSDs as Cache! The SSD’s form part of the Extent Store which is for persistent data storage. The OpLog which is also on SSD is also persistent and not a “cache”. As such, no capacity is being reduced as a result of caching.

10. Snapshot Reserves

Nutanix does not use reserve capacity for snapshots. Snapshots simply use available capacity in the storage pool. If you don’t use snapshots, no space is wasted, if you do use snapshots, then the delta changes are stored. Simple as that.

Summary:

From the 100TB RAW factoring in what is a realistic Nutanix configuration including N+1 to tolerate a node failure and support the cluster being able to fully self heal the effective usable capacity is 45.2TB which is just under 50% of 100TB RAW.

This is a very simple configuration to manage from both a performance and capacity perspective, and one which is easily calculated and repeatable.

If the Resiliency Factor was 3 (which IMO is rarely if ever required) across the entire environment (which again would be extremely unusual as VMs which require RF3 can be configured in an RF3 container) then the usable capacity would be ~30TB which is only sightly below the SAN/NAS example and RF3 delivers higher resiliency.

In reality, >95% of workloads should be deployed on RF2, with a very small number of VMs possibly using RF3. In reality RF2 is extremely resilient and self healing so IMO RF3 is rarely required.

So in conclusion, Nutanix usable capacity is ~50% of RAW capacity, the difference between Nutanix and traditional SAN/NAS is you actually can use almost all the “usable” capacity and maintain optimal performance with little/no complexity.

Nutanix also has data reduction technologies such as Compression and De-duplication, along with intelligent cloning to increase the effective capacity of the storage pool.

While I believe Nutanix’ usable capacity today is excellent especially when considering how resilient RF2 is and comparing usable capacity to many products on the market, Nutanix has the advantage of not being constrained by legacy technologies such as RAID, so I’ll leave you with a little teaser:

Usable capacity will be improving significantly in upcoming releases of Nutanix Operating System. 🙂

Storage I/O Control (SIOC) configuration for Nutanix

Storage I/O Control (SIOC) is a feature introduced in VMware vSphere 4.1 which was designed to allow prioritization of storage resources during periods of contention across a vSphere cluster. This situation has often been described as the “Noisy Neighbor” issue where one or more VMs can have a negative impact on other VMs sharing the same underlying infrastructure.

For traditional centralized shared storage, enabling SIOC is a “No Brainer” as even using the default settings will ensure more Consistent performance during periods of storage contention which all but no downsides. SIOC does this by managing and potentially throttling the Device Queue depth based on “Shares” assigned to each Virtual Machine to ensure Consistent performance across ESXi hosts.

The below diagrams show the impact on three (3) identical VMs with the same Disk “Shares” values with and without SIOC in a traditional centralized storage environment (a.k.a SAN/NAS).

Without Storage I/O Control

NutanixWithoutSIOC

With Storage I/O Control

NutanixWithSIOC

 

As show in the above, where VMs have equal share values but reside on different ESXi hosts can result in an undesired result with one VM having double the available storage queue compared to the VMs residing on a different host.  In comparison, SIOC ensuring VMs with the same share value get equal access to the underlying storage queue.

While SIOC is an excellent feature, it was designed to address a problem which is no longer a significant factor with the Nutanix Scale out Shared nothing style architecture.

The issue of “noisy neighbour” or storage contention in the Storage Area Network (SAN) is all but eliminated as all Datastores (or “Containers” in Nutanix speak) are serviced by every Nutanix Controller VM in the cluster and under normal circumstances, upwards on 95% of read I/O is serviced by the local Controller VM, Nutanix refers to this feature as “Data Locality”.

Data Locality ensures data being written and read by a VM remains on the Nutanix node where the VM is running, thus reducing latency of accessing data across a Storage Area Network, and ensuring that a VM reading data on one node, has minimal or no impact on another VM on another node in the cluster.

As Write I/O is also distributed throughout the Nutanix cluster, which means no one single node is monopolized by (Write) replication traffic.

Storage I/O Control was designed around the concept of a LUN or NFS mount (from vSphere 5.0 onwards) where the LUN or NFS mount is served by a central storage controller, as is the most typical deployment in the past for VMware vSphere environment.

As such, SIOC limiting the LUN queue depth allowed all VMs on the LUN to have either an equal share of the available queue, OR by specifying “Share” values on a VM basis, ensure VMs can be prioritized based on importance.

By default, all Virtual Hard Disks have a share value of “Normal” (1000 shares). Therefore if a individual VM needs to be given higher storage priority, the Share value can be increased.

Note: In general, modifying VM Virtual disk share values should not be required.

As Nutanix has one Storage Controller (or CVM) per node which all actively service I/O to the Datastore, SOIC is not required, and provides no benefits.

For more information about SIOC in traditional NAS environments see: “Performance implications of Storage I/O Control – Enabled NFS Datastores in VMware vSphere 5.0

As such for Nutanix environments it is recommended SIOC control be disabled and DRS “anti-affinity” or “VM to host” rules be used to separate high I/O VMs.

Ensuring Data Integrity with Nutanix – Part 2 – Forced Unit Access (FUA) & Write Through

In the Integrity of Write I/O for VMs on NFS Datastores Series, I discussed Forced Unit Access (FUA) and Write Through in part 2 which covered a vendor agnostic view of FUA and Write through.

In this series, The goal is too explain how Nutanix can guarantee data integrity and how this is Nutanix Number #1 priority. In addition to this goal, I want to show how Nutanix supports Business Critical Applications such as MS SQL and MS Exchange which have strict storage requirements such as Write Ordering, Forced Unit Access (FUA) , SCSI abort/reset commands and to protect against Torn I/O.

Note: With Windows 2012 onwards FUA is no longer used in favour of issuing a “Flush” of the drives write cache. However this change makes no difference to Nutanix environments because regardless of FUA or a Flush being used, write I/O is not acknowledged until written to persistent media on 2 or more nodes which will be explained further later in this post.

Currently MS Exchange is not supported to run in a VMDK on an NFS datastore/s although interestingly Active Directory and MS SQL servers which have the exact same storage requirements (discussed earlier) are supported. This post will show why Microsoft should allow storage vendors to certify Exchange in a VMDK on NFS datastore deployments to prove compliance with the storage requirements stated earlier.

Note: Nutanix provides support for Exchange 2010/2013 deployments in VMDKs on NFS datastores. Customers can find this support statement on http://portal.nutanix.com/ under article number 000001303.

Firstly I would like to start by stating that FUA is fully supported by VMware ESXi.

In the Microsoft article, Deploying Transactional NTFS, it states:

“The caching control mechanism used by TxF is a flag known as the Force Unit Access (FUA) function. This flag specifies that the drive should write the data to stable media storage before signaling complete.”

Nutanix meets this requirement as all writes are written to persistent media (SSD) on at least two independent nodes and no write caching is performed at any layer including the Nutanix Controller CM (CVM), Physical Storage Controller card or the physical drives themselves.

For more information on how Nutanix is compliant with this requirement click here.

The article also states:

“Some Host Bus Adapters (HBAs) and storage controllers (for example, RAID systems) have built-in battery-backed caches. Because these devices preserve cached data if a power fault occurs, any disks connected to them are not required to honor the FUA flag. Further, a disk whose power supply is protected by an uninterruptable power supply (UPS) does not need to honor the FUA flag. This is because the UPS will maintain power long enough for the disk to flush its cache to the media.”

As discussed with the previous requirement, Nutanix meets this requirement as the write acknowledgement is not given until writes are successfully commited to persistent storage on at least two nodes. As a result, even without a UPS, data integrity can be guaranteed in a Nutanix environment.

For more information on how Nutanix is compliant with this requirement click here.

Another key point in the article is:

“Disabling a drive’s write cache eliminates the requirement for the drive to honor the FUA flag.”

All physical drives (SSD and SATA) in a Nutanix nodes have their write cache disabled, therefore removing the requirement of FUA.

The article concludes with the following:

“Note  For TxF to be capable of consistently protecting your data’s integrity through power faults, the system must satisfy at least one of the following criteria:

 

1. Use server-class disks (SCSI, Fiber Channel)

2. Make sure the disks are connected to a battery-backed caching HBA.

3. Use a storage controller (for example, RAID system) as the storage device.

4. Ensure power to the disk is protected by a UPS.

5. Ensure that the disk’s write caching feature is disabled.”

We have already discussed because Nutanix does not use a non persistent write cache there is no requirement for the OS to issue the FUA flag or the Flush command in Windows 2012 to ensure data is written to persistent media. But for fun lets see how many of the above Nutanix is compliant with.

1. YES – Nutanix uses enterprise grade Intel S3700 SSDs for all write I/O
2. N/A – There is no need for battery backed caching HBAs due to Nutanix write acknowledgement not being given until written to persistent media on two or more nodes
3. YES – Nutanix Distributed File System (NDFS) with Resiliency Factor (RF) 2 or 3
4. Recommended to ensure system uptime but not required to ensure data integrity as writes are not acknowledged until written to persistent media on two or more nodes
5. YESAll write caching features are disabled on all SSDs/HDDs

So to meet Microsoft’s FUA requirements, only one of the above is required. Nutanix meets 3 out of 5 outright, with a 4th being Recommended (but not required) and the final requirement not being applicable.

Write Cache and Write Acknowledgements.

Nutanix does not use a non persistent write cache, period.

When a I/O is issued in a Nutanix environment, if it is Random, it will be sent to the “OpLog” which is a persistent write buffer stored on SSD.

If the I/O is sequential, it is sent straight to the Extent Store which is persistent data storage, also located on SSD.

Both Random and Sequential I/O flows are shown in the below diagram from The Nutanix Bible by @StevenPoitras.

NDFS_IO_basev5

All Writes are also protected by Resiliency Factor (RF) of 2 or 3, meaning 2 or 3 copies of the data are synchronously replicated to other Nutanix nodes within the cluster prior to the write being acknowledged.

To be clear, Write acknowledgements are NOT sent until the data is written to 2 or 3 nodes OpLog or Extent Store (depending on the configured RF). What this means is the requirement for Forced Unit Access (FUA) is achieved as every write is written to persistent media before write acknowledgements are sent regardless of FUA (or Flush) being issued by the OS.

Importantly, this write acknowledgement process is the same regardless of the storage protocol (iSCSI , NFS , SMB 3.0) used to present storage to the hypervisor (ESXi , Hyper-V or KVM).

Physical Drive Configuration

As Nutanix does not use a non persistent write cache, and does not acknowledge writes until written to persistent media on 2 or 3 nodes, that’s the end of the problem right?

Not really, as physical drives also have write caches and in the event of a power failure, its possible (albeit unlikely) data in the cache may not be written to disk even after a write acknowledgement is written.

This is why all physical SSD / SATA drives in a Nutanix environment have the disks write caching feature disabled.

This ensures there is no dependency on Uninterruptable power supplies (UPS) to ensure data is successfully written to the disk in the event of a power failure.

This means Nutanix is compliant with the “Ensure that the disk’s write caching feature is disabled” requirement specified by Microsoft.

Uninterruptible Power Supplies (UPS)

As non persistent write caching is not used either at the Nutanix Controller VM (CVM), Physical Storage Controller OR the physical SSDs/HDDs, the use of a UPS is not a requirement for a Nutanix environment to ensure data integrity, however it is still recommended to use a suitable UPS to ensure uptime of the environment. Assuming a power outage is not catastrophic (e.g.: For a single node) and the cluster is still online, write acknowledgements are still not given until data is written to the configured RF policy as Nutanix nodes are effectively stateless.

The Microsoft article quoted earlier states:

“Further, a disk whose power supply is protected by an uninterruptable power supply (UPS) does not need to honor the FUA flag. This is because the UPS will maintain power long enough for the disk to flush its cache to the media.”

Even in the case a storage solution or disk is protected by a UPS, it requires sufficient time to allow all data in the cache to be written to persistent media. This is a potential risk to data consistency as a UPS is just another link in the chain which can go wrong. This is why Nutanix does not depend on UPS for data integrity.

Another Microsoft Article, Key factors to consider when evaluating third-party file cache systems with SQL Server gives two examples of how data corruption can occur:

“Example 1: Data loss and physical or logical corruption”

“Example 2: Suspect database”

So how does Nutanix protect against these issues?

The article states

“How to configure a product providing file cache from something like non-battery backed cache is specific to the vendor implementation. A few rules, however, can be applied:

1. All writes must be completed in or on stable media before the cache indicates to the operating system that the I/O is finished.

2. Data can be cached as long as a read request serviced from the cache returns the same image as located in or on stable media.”

Regarding the first point: All write I/O is written to persistent media (as is the intention of FUA) as described earlier in this article.

For the second point, the Nutanix Distributed File System (NDFS) read I/O can happen from one of the following places:

  1. “Extent Cache”, located in RAM.
  2. “Content Cache”, located on SSD as per earlier diagram.
  3. “Oplog”, the persistent write cache located on SSD as per earlier diagram.
  4. “Extent store” located on either SSD or SATA depending on if the data is “Hot” or ‘Cold”.
  5. A remote nodes Extent Cache, Content Cache, OpLog or Extent Store

To ensure the Extent Cache (in RAM) is consistent with the Content Cache or Extent Store located on the persistent media, when a write I/O occurs which modifies data which has been cached in the Extent Cache (in RAM), the corresponding data is discarded from the Extent Cache and only promoted back to the Extent Cache if the data profile remains Hot (i.e.: Frequently accessed).

In Summary:

The Nutanix write path guarantees (even without the use of a UPS) writes are written to persistent media and have at least 1 redundant copy on another node in the cluster for resiliency before acknowledging the I/O back to the hypervisor and onto the guest. This is critical to ensuring data consistency/resiliency.

This is in full compliance with the storage requirements of applications such as SQL, Exchange and Active Directory.

——————————————————–

Integrity of Write I/O for VMs on NFS Datastores Series

Part 1 – Emulation of the SCSI Protocol
Part 2 – Forced Unit Access (FUA) & Write Through
Part 3 – Write Ordering
Part 4 – Torn Writes
Part 5 – Data Corruption

Nutanix Specific Articles

Part 6 – Emulation of the SCSI Protocol (Coming soon)
Part 7 – Forced Unit Access (FUA) & Write Through
Part 8 – Write Ordering (Coming soon)
Part 9 – Torn I/O Protection (Coming soon)
Part 10 – Data Corruption (Coming soon)

Related Articles

1. What does Exchange running in a VMDK on NFS datastore look like to the Guest OS?
2. Support for Exchange Databases running within VMDKs on NFS datastores (TechNet)
3. Microsoft Exchange Improvements Suggestions Forum – Exchange on NFS/SMB
4. Virtualizing Exchange on vSphere with NFS backed storage