Cloning VMs – Why less (I/O & throughput) is better!

I’ve seen the picture below floating around Twitter and LinkedIn which shows a 32GB VM being cloned in just 7 seconds on an All Flash Array (AFA) and has got a lot of attention.

The AFA peaked at over 7000MB/s during this time showing the AFA is capable of some serious throughput!345363bf-bbb3-4389-aafa-71c81f182de3-large

At this stage some people may be thinking im talking about Nutanix, so I would like to point out the above AFA is not a Nutanix NX-9000 All Flash Node.

So why did I write this post?

I am still surprised that technical people find this sort of test and result impressive, because to me the fact the AFA used 7000MB/s of bandwidth to perform the clone means it has not intelligently performed the clone and the process has used additional capacity while potentially having a high impact on the other workloads using the storage.

At this stage I guess I should explain what I mean by intelligently clone.

An intelligent clone in my mind is where:

a) The clone takes a few seconds to occur
b) The clone is offloaded to the storage layer
c) Uses almost zero I/O & bandwidth to perform the clone
d) Uses almost zero additional space

So in the above example, the solution has cloned the VM in a few seconds, so a) has been satisfied, and since there is no information provided I’m going to give it the benefit of the doubt and say the clone was offloaded to the storage layer, so im assuming (rightly or wrongly) that b) is also satisfied.

But what about c) and d).

If the clone uses 7000MB/s of bandwidth that must have some impact (if not a significant impact) on other workloads running on the storage, even if it is only for 7 seconds.

The clone was also writing data throughout the 7 seconds, so its also duplicating the data.

So the net result is a fast yet high impact (capacity / performance) clone.

Back in 2012, when I worked at IBM, I wrote this post (Netapp Edge VSA – Rapid Cloning Utility) about intelligent cloning, as a customer was suffering terrible VDI recompose times due to using a big dumb storage solution which had no inteligent cloning capabilities. The post shows even on an old IBM x3850 M2 with slow old 4 core processors running a Virtual Storage Appliance running on 3 peices of spinning rust (146GB SAS disks) and it still completes the task in just 4.73 seconds per clone in full compliance with the 4 items I identified as aspects of intelligent cloning (below).

a) The clone takes a few seconds to occur
b) The clone is offloaded to the storage layer
c) Uses almost zero I/O & bandwidth to perform the clone
d) Uses almost zero additional space

The reason intelligent cloning is so much faster is because there is no need to duplicate a VM, the intelligent cloning process simply creates pointers back to the original file (which remains Read Only) and only uses I/O & capacity when new data is created.

The process is actually mostly dependant on vCenter to register the new VM which is why the process takes a couple of seconds as the process takes almost no time at the storage layer. The size of the VM being cloned is irrelevant. (Note: In my post from 2012 it was a 10Gb VM although again the size has no impact on the speed of an intelligent clone)

In the post from 2012, I made the following observation:

Even if you have the worlds fastest array (insert you favorite vendor here), storage connectivity and the biggest and most powerful ESXi hosts the process of cloning a large number of virtual machines will still;

1. Take more time to complete than an intelligent cloning process like RCU

2. Impact the performance of your ESXi hosts and more than likley production VMs

3. Impact the performance of your storage network & array (and anything that uses it , physical or virtual).

So fast forward to 2015, we have lots of really fast All-Flash storage solutions, but for tasks like cloning, even these super fast all-flash solutions can’t outperform a single controller (2vCPU) Virtual Storage appliance running on an old IBM x3850 M2 server running in my test lab using intelligent cloning from back in 2012.

I also wrote this article (Is VAAI beneficial with Virtual Storage Appliance (VSA) based solutions ?) recently explaining the benefits of VAAI-NAS and how VAAI-NAS supports intelligent cloning even with Virtual Storage Appliance solutions.

In Summary:

I find a clone taking a few seconds and using next to no throughput and capacity to be impressive. This is a perfect example of less I/O and throughput (to perform the same task) being better!

Its great if a storage array has the capability to drive many GB/s of throughput, but its totally unnecessary for cloning and is only demonstrating the lack of intelligent cloning capabilities for the storage solution.

In my opinion its much better for a storage solutions to use its high performance capability for driving I/O to virtual machines servicing business applications than for tasks like cloning which can be done intelligently.

To show off more real world performance capabilities of a storage solution (especially an All-Flash array), the example really has to include multiple workloads with different I/O characteristics. This is something the storage industry (all vendors) continues to fail to provide and its something I would like to be a part of changing as things like “Peak” performance are no where near as important as “consistent” performance.

Back on topic though, If cloning is something you or your customers require, for say a VDI, Cloud deployment or just for rapid provisioning of testing & development VMs, consider a storage solution which has intelligent cloning capabilities such as VAAI-NAS which integrates with products like Horizon View (VCAI Clones) and vCloud Director (FAST Provisioning).

Integrity of I/O for VMs on NFS Datastores – Part 5 – Data Corruption

This is the fifth part of a series of posts covering how the Integrity of Write I/O is ensured for Virtual Machines when writing to VMDK/s (Virtual SCSI Hard Drives) running on NFS datastores presented via VMware’s ESXi hypervisor as a “Datastore”.

This part will focus on Data Corruption.

As a reminder from the first post, this post is not talking about presenting NFS direct to Windows.

So why am I covering data corruption? Simple, because there is a misconception that SCSI commands are not properly supported for VMs running on NFS datastores which leads to corruption. This was covered in Part 1, so Part 5 will focus on data corruption not specific to NFS, but which can effect all storage platforms and how it occurs, then how storage solutions can mitigate the risk of data corruption issues.

The following data is a summary of the data provided in An analysis of data corruption in the storage stack.

Netapp conducted a large scale study into data corruption, which covered >1 Million HDDs across tens of thousands of Netapp systems over 41 months (2004 – 2007) and long story short, Netapp detected a level of data corruption which surprised me and seems to disprove many things like advertised MTBF for HDDs.

The following shows a breakdown of the problems found.

netappfailureanalysis

The first thing I noticed in the above pie charts is the vast difference between the percentage of failures in Enterprise grade disks (left) and nearline based disks (right).

It also shows physical interconnects to be a large percentage of failures, which highlights the need for simplicity in the storage solution. In addition, one of the more surprising results in the level of storage protocol and performance based failures being the cause of corruption.

Note: In this study, the majority of systems deployed were FC (Block storage based) based, this highlights that a storage protocol itself regardless of being block or file based storage, can have issues if improperly implemented. So regardless of storage protocol, corruption can occur.

The below summary of corruption type and percentage of disks effected shows the dramatic 10x more issues with SATA drives compared to Enterprise grade drives.

NLvsEnterprise

The above also shows bit corruptions or Torn Writes effect more disks compared to lost or misdirected writes, which highlights the importance of Torn I/O Protection (covered in Part 4).

The article summarizes in the following points:summary

The main take away from my perspective is:

1. The requirement to have corruption handling mechanisms for any environment running workloads which require data integrity.
2. Data should be spread out (ideally across disks) to minimize the chance of issues.

The article went on to form these conclusions:

conclustion

In Summary:

1. Data corruption can occur on JBOD , enterprise grade storage solutions and everything in between.
2. SATA drives have a much higher rate (~10x) of corruption.
3. Enterprise grade drives are much better from a data integrity perspective.
4. Corruption handling via sector and ideally block based checksums is essential on writes.
5. Using a checksum on Read helps detect corrupted data.
6. Corruption can occur even when no ECC errors are reported by a physical HDD.
7. Any storage protocol implementation can have bugs which can lead to corruption.
8. Backup / Recovery solutions are essential. Reliance solely on primary storage or application level backups using disks puts your data at risk.
9. Solutions solely dependant on application level data protection on disk are at risk of corrupted data being replicated to other active/passive or backup copies.

My final point, in an enterprise grade storage solutions which use checksums to verify data integrity on write and reads, have a much lower risk of data corruption regardless of media type and storage protocol.

JBOD style deployments using SATA drives have a significantly higher risk of data corruption which is contributed to by the SATA drives 10x higher corruption rates and the lack of enterprise grade checksum features found in some shared storage (SAN/NAS) solutions.

Integrity of Write I/O for VMs on NFS Datastores Series

Part 1 – Emulation of the SCSI Protocol
Part 2 – Forced Unit Access (FUA) & Write Through
Part 3 – Write Ordering
Part 4 – Torn Writes
Part 5 – Data Corruption

Nutanix Specific Articles

Part 6 – Emulation of the SCSI Protocol (Coming soon)
Part 7 – Forced Unit Access (FUA) & Write Through (Coming soon)
Part 8 – Write Ordering (Coming soon)
Part 9 – Torn I/O Protection (Coming soon)
Part 10 – Data Corruption (Coming soon)

Related Articles

1. What does Exchange running in a VMDK on NFS datastore look like to the Guest OS?
2. Support for Exchange Databases running within VMDKs on NFS datastores (TechNet)
3. Microsoft Exchange Improvements Suggestions Forum – Exchange on NFS/SMB
4. Virtualizing Exchange on vSphere with NFS backed storage

Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for QA / Pre-Production Servers

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for an environment running Quality Assurance or Pre-Production server workloads?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. Average Server VM is between 2-4vCPU and 4-8GB Ram with some larger.
5. Memory is the first compute level constraint.
6. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
7. vSphere 5.5 or earlier

Requirements

1. The environment must deliver consistent performance
2. Minimize the cost of shared storage

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Leave TPS disabled (default) and leave Large Memory pages enabled (default).

Justification

1. QA/Pre-Production environments should be as close as possible to the configuration of the actual production environment. This is to ensure consistency between QA/Pre-Production validation and production functionality and performance.
2. Setting 100% memory reservations ensures consistent performance by eliminating the possibility of swapping.
3. The 100% memory reservation also eliminates the capacity usage by the vswap file which saves space on the shared storage as well as reducing the impact on the storage in the event of swapping.
4. RAM is cheaper than Tier 1 storage (which is recommended for vSwap storage to ensure minimal performance impact during swapping) so the increased cost of memory in the hosts is easily offset by the saving in Tier 1 shared storage.
5. Simplicity. Leaving default settings is advantageous from both an architectural and operational perspective.  Example: ESXi Patching can cause settings to revert to default which could negate TPS savings and put a sudden high demand on storage where TPS savings are expected.
6. TPS savings for server workloads is typically much less than with desktop workloads and as a result less attractive.
7. The decision has been made to use 2 socket ESXi hosts and scale out so the TPS savings per host compared to a 4 socket server with double the RAM will be lower.
8. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM leading to more consistent performance under a wider range of circumstances.
9. Lower core count (and lower cost) CPUs will likely be viable as RAM will likely be the first constraint for further consolidation.
10. Remove the real or perceived security risk of sensitive information being gathered from other VMs using TPS as described in VMware KB 2080735

Implications

1. Using 100% memory reservations requires ESXi hosts and the cluster be sized at a 1:1 ratio of vRAM to pRAM (Physical RAM) and should include N+1 so a host failure can be tolerated.
2. Increased RAM costs
3. No memory overcommitment can be achieved
4. Potential for lower CPU utilization / overcommitment as RAM may become the first constraint.

Alternatives

1. Use 50% reservation and enable TPS
2. Use no reservation, Enable TPS and disable large pages

Related Articles:

1. Transparent Page Sharing (TPS) Example Architectural Decisions Register

2. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl (VCDX#104)