Cloning VMs – Why less (I/O & throughput) is better!

I’ve seen the picture below floating around Twitter and LinkedIn which shows a 32GB VM being cloned in just 7 seconds on an All Flash Array (AFA) and has got a lot of attention.

The AFA peaked at over 7000MB/s during this time showing the AFA is capable of some serious throughput!345363bf-bbb3-4389-aafa-71c81f182de3-large

At this stage some people may be thinking im talking about Nutanix, so I would like to point out the above AFA is not a Nutanix NX-9000 All Flash Node.

So why did I write this post?

I am still surprised that technical people find this sort of test and result impressive, because to me the fact the AFA used 7000MB/s of bandwidth to perform the clone means it has not intelligently performed the clone and the process has used additional capacity while potentially having a high impact on the other workloads using the storage.

At this stage I guess I should explain what I mean by intelligently clone.

An intelligent clone in my mind is where:

a) The clone takes a few seconds to occur
b) The clone is offloaded to the storage layer
c) Uses almost zero I/O & bandwidth to perform the clone
d) Uses almost zero additional space

So in the above example, the solution has cloned the VM in a few seconds, so a) has been satisfied, and since there is no information provided I’m going to give it the benefit of the doubt and say the clone was offloaded to the storage layer, so im assuming (rightly or wrongly) that b) is also satisfied.

But what about c) and d).

If the clone uses 7000MB/s of bandwidth that must have some impact (if not a significant impact) on other workloads running on the storage, even if it is only for 7 seconds.

The clone was also writing data throughout the 7 seconds, so its also duplicating the data.

So the net result is a fast yet high impact (capacity / performance) clone.

Back in 2012, when I worked at IBM, I wrote this post (Netapp Edge VSA – Rapid Cloning Utility) about intelligent cloning, as a customer was suffering terrible VDI recompose times due to using a big dumb storage solution which had no inteligent cloning capabilities. The post shows even on an old IBM x3850 M2 with slow old 4 core processors running a Virtual Storage Appliance running on 3 peices of spinning rust (146GB SAS disks) and it still completes the task in just 4.73 seconds per clone in full compliance with the 4 items I identified as aspects of intelligent cloning (below).

a) The clone takes a few seconds to occur
b) The clone is offloaded to the storage layer
c) Uses almost zero I/O & bandwidth to perform the clone
d) Uses almost zero additional space

The reason intelligent cloning is so much faster is because there is no need to duplicate a VM, the intelligent cloning process simply creates pointers back to the original file (which remains Read Only) and only uses I/O & capacity when new data is created.

The process is actually mostly dependant on vCenter to register the new VM which is why the process takes a couple of seconds as the process takes almost no time at the storage layer. The size of the VM being cloned is irrelevant. (Note: In my post from 2012 it was a 10Gb VM although again the size has no impact on the speed of an intelligent clone)

In the post from 2012, I made the following observation:

Even if you have the worlds fastest array (insert you favorite vendor here), storage connectivity and the biggest and most powerful ESXi hosts the process of cloning a large number of virtual machines will still;

1. Take more time to complete than an intelligent cloning process like RCU

2. Impact the performance of your ESXi hosts and more than likley production VMs

3. Impact the performance of your storage network & array (and anything that uses it , physical or virtual).

So fast forward to 2015, we have lots of really fast All-Flash storage solutions, but for tasks like cloning, even these super fast all-flash solutions can’t outperform a single controller (2vCPU) Virtual Storage appliance running on an old IBM x3850 M2 server running in my test lab using intelligent cloning from back in 2012.

I also wrote this article (Is VAAI beneficial with Virtual Storage Appliance (VSA) based solutions ?) recently explaining the benefits of VAAI-NAS and how VAAI-NAS supports intelligent cloning even with Virtual Storage Appliance solutions.

In Summary:

I find a clone taking a few seconds and using next to no throughput and capacity to be impressive. This is a perfect example of less I/O and throughput (to perform the same task) being better!

Its great if a storage array has the capability to drive many GB/s of throughput, but its totally unnecessary for cloning and is only demonstrating the lack of intelligent cloning capabilities for the storage solution.

In my opinion its much better for a storage solutions to use its high performance capability for driving I/O to virtual machines servicing business applications than for tasks like cloning which can be done intelligently.

To show off more real world performance capabilities of a storage solution (especially an All-Flash array), the example really has to include multiple workloads with different I/O characteristics. This is something the storage industry (all vendors) continues to fail to provide and its something I would like to be a part of changing as things like “Peak” performance are no where near as important as “consistent” performance.

Back on topic though, If cloning is something you or your customers require, for say a VDI, Cloud deployment or just for rapid provisioning of testing & development VMs, consider a storage solution which has intelligent cloning capabilities such as VAAI-NAS which integrates with products like Horizon View (VCAI Clones) and vCloud Director (FAST Provisioning).

5 thoughts on “Cloning VMs – Why less (I/O & throughput) is better!

  1. Hi Josh –

    First of all, full disclosure, my name is Johnny Hatch and I am an SE working at EMC XtremIO.

    You make some excellent points regarding intelligent cloning and I am in full agreement with you about the 4 points you list. I’d like to provide some clarity on the screen shots regarding how XtremIO does indeed meet all 4 criteria.

    a & b) Not much to say here. You’re assumptions are correct that XtremIO does satisfy both of these. We support the VAAI primitives and can offload the cloning function to the storage array.

    c) Uses almost zero I/O and bandwidth to perform the clone…

    One of the biggest differentiators for XtremIO is how we handle all meta data operations in memory. Couple this with inline deduplication and we’re able to accomplish some very cool things. The bandwidth being reported in the GUI is the effective throughput the customer is experiencing when in reality there is little to no bandwidth being consumed at all.

    Let me explain further. I mentioned that XtremIO performs all meta data operations in memory. When a VAAI X-Copy command is sent down to the storage array XtremIO will clone all the meta data then inline dedupe kicks in simply providing a pointer from the new meta data reference residing in DRAM to the same, duplicate physical block on SSD. So many vendors position their all flash array as fast because it’s all flash. What makes XtremIO fast isn’t the fact that it’s all flash (although it doesn’t hurt), rather it’s our in memory operations.

    This leads nicely to part d) being satisfied…

    d) Uses almost zero additional space…

    I would actually take this statement a step further and ask how the space efficiency is handled. Dedupe on XtremIO is always done in memory and always inline – it’s never a post processed activity. The benefit here is multifold:

    First, it’s more efficient. In the AFA space the SSDs are plenty performant so that’s not typically our bottleneck. The choke point is the storage processor. Trying to perform dedupe as a post process event steels precious CPU cycles away from processing front end, user I/O to go back and find duplicate blocks. Doing it inline and in memory is faster and a lot more consistent (I think consistency is deserving an entire blog post).

    Second, post process dedupe excessively wear levels the flash drives (NAND flash has a half life).

    Third, users must manage their space to a high water mark. Will the user have enough space to land the duplicate blocks down to disk / SSD in order for dedupe to come back through and dedupe it back down? Inline dedupe doesn’t have this issue because only unique blocks get written.

    Sorry to get sales-y on you. Similar to yourself, I love what I do.

    Johnny

    • Hey Johnny,

      Firstly thanks for the comments.

      I totally agree the storage processors are generally the bottleneck (not the SSDs or even HDDs in non AFAs). This is actually one key point I probably didn’t highlight enough is that intelligent cloning saves storage processor CPU (because it uses basically zero cpu cycles).

      Inline or Post-Process dedupe requires cpu cycles on the storage processors which none of us want to use (or at least we wan’t to minimize).

      When truly intelligently cloning, we don’t duplicate data which removes the need for inline (from memory or SSD/HDD) or post process dedupe, therefore saving storage processor CPU cycles, this leads to lower impact on storage processors and more consistent performance which you highlighted needs a full post, and I agree.

      In fairness to XtremIO, Block Storage doesn’t have the equivalent intelligent cloning VAAI primitives that NFS does so it’s more a storage protocol limitation that a XtremIO one. XCOPY is great to offload to the array (Point “b” in my post) but its not the same as “Fast File Clone” primitive. XCOPY does things like performing Clones or Storage vMotion’s direct on the storage on behalf of the Data Mover but the process as you explained still requires XtremIO to perform the inline dedupe (albeit quite fast in memory which is better than from SSD or HDD) whereas the data is never duplicated (even in memory) with VAAI-NAS Fast File Clone.

      On the other hand, Block Storage (and XtremeIO) has the advantage of XCOPY for Storage vMotion, whereas NFS does not support offloading storage vMotion, but then again, the requirement for storage vMotion in an NFS environment is significantly reduced or eliminated depending on platform.

      I have to give XtremeIO credit as it clones in a very efficient way considering its based on Block storage, but my opinion is it is constrained (at least with the current VAAI implementation) by being block storage and can’t compete with the efficiency of a VAAI-NAS clone which as I mentioned earlier is a storage protocol issue, not an XtremIO one.

      In reality Netapp owned the cloning space as they have done this very well for many years with FlexClone which is protocol agnostic, IMO FlexClone still would have to be equal to the most efficient clone in the market and doesn’t use inline or post process dedupe.

      Always happy to have a chat with people passionate about tech.

      Cheers

  2. Pingback: vCoffee Links #11 – Welcome 2015!!! » vHersey - VCDX Two to the Seventh Power (#128)

  3. Pingback: Nutanix Platform Link-O-Rama | vcdx133.com

  4. Pingback: Newsletter: July 18, 2015 | Notes from MWhite