Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

In recent weeks, I have seen numerous RFQs which have the requirement for 24×7 2 or 4hr onsite HW replacement, and while this is not uncommon I’ve been thinking why is this the case?

Over my I.T career spanning coming up on 15 years, in the majority of cases, I have strongly recommended in my designs and Bill of Materials (BoMs) that customers buy 24×7 4 hours onsite hardware maintenance contracts for equipment such as Compute, Storage Arrays , Storage Area Networking and IP network devices.

I have never found it difficult to justify this recommendation, because traditionally if a component in the datacenter fails, such as a Storage Controller, this generally has a high impact on the customers business and could cost tens or hundreds of thousands of dollars or even millions of dollars in revenue depending on the size of the customer.

Not only is loosing a Storage controller general a high impact, it is also a high risk as the environment may no longer have redundancy and a subsequent failure could (and would likely) result in a full outage.

So in this example, a typical storage solution has a Storage Controller failure resulting in degraded performance (due to loosing 50% of the controllers) and high impact/risk to the customer, a customer purchasing 24×7 4 Hour, or even 24×7 2hr support contract makes perfect sense! The question is, why choose HW (or a solution) which puts you at high risk after a single component failure in the first place?

With technology fast changing and over the last year or so, I’ve been involved in many customer meetings where I am asked what I recommend in terms of hardware maintenance contracts (for Nutanix customers).

Normally this question/conversation happens after the discussion about the technology, where I explain various failure scenarios and how resilient a Nutanix cluster is.

My recommendation goes something like this.

If you architect your solution for your desired level of availability (e.g.: N+2) there is no need to buy 24×7 4hr hardware maintenance contract, the default Next Business Day option is perfectly fine.

Justification:

1. In the event of even an entire node failure, the Nutanix cluster will have automatically self healed the configured resiliency factor (2 or 3) well before even a 2hr support contract can provide a technician to be onsite, diagnose the issue and replace hardware.

2. Assuming the HW is replaced on the 2hr mark (not typical in my experience), AND assuming Nutanix was not automatically self healing prior to the drive/node replacement, the replacement drive or node would then START the process of self healing. So the actual time to recovery would be greater than 2hrs. In the case of Nutanix, self heal begins almost immediately.

3. If a cluster is sized for the desired level of availability based on business requirements, say N+2, a Node can fail, Nutanix will automatically self heal and then tolerate a subsequent failure with the ability to full self heal the configured resiliency factor (2 or 3) again.

4. If a cluster is sized only to customer requirement of only N+1, a Node can fail, Nutanix will automatically and fully self heal. Then in the unlikely (but possible) event of a subsequent failure (i.e.: A 2nd node failure before the next business day warranty replaces the failed HW), the Nutanix cluster will still continue to operate.

5. The performance impact of a node failure in a Nutanix environment is N-1, so in a worst case scenario (3 node cluster) the impact is 33%, compared to a 2 controller SAN/NAS where the impact would be 50%. In a 4 node cluster the impact is only 25% and for customer with say 8 nodes only 12.5%. The bigger the cluster the lower the impact. Nutanix recommends N+1 up to 16 nodes, and N+2 up to 32 nodes. Beyond 32 nodes higher levels of availability may be desired based on customer requirements.

The risk and impact of the failure scenario/s is key, in the case of Nutanix, because of the self healing capability, and the fact all controllers and SSDs/HDDs in the cluster participate in the self heal, it can be done very quickly and with low impact. So the impact of the failure is low (N-1) and the recovery is done quickly, so the risk to the business is low, therefore dramatically reducing (and in my opinion potentially removing) the requirement for a 24×7 2 or 4hr support contract for Nutanix customers.

In Summary:

1. The decision on what hardware maintenance contract is appropriate is a business level decision which should be based in part on a comprehensive risk assessment done by an experienced enterprise architect, intimately familiar with all the technology being used.

2. If the recommendation from the trusted experienced enterprise architect is that the risk of HW failure causing high impact or outage to the business is so high that purchasing a 4hr or 2hr onsite HW replacement is required, my advise would be to reconsider if the proposed “solution” meets the business requirements. Only if you are constrained to that solution, purchase a 24×7 2 or 4hr support contract.

3. Being heavily dependant on Hardware being replaced to restore resiliency / performance for a solution, is in itself a high risk to the business.

AND

4. In my experience, it is not uncommon to have problems getting onsite support or hardware replacement regardless of the support contract / SLA. Sometimes this is outside a vendors control, but most vendors will experience one or more of these issues which I have personally experienced on numerous occasions in previous roles:

a) Vendors failing to meet SLA for onsite support.
b) Vendors failing to have the required parts available within the SLA.
c) Replacement HW being refurbished (common practice) and being faulty.
d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

Note: Support contracts don’t promise a resolution by the 2hr / 4hr contract, they simply promise somebody will be onsite and in some cases this is only after you have gone through troubleshooting with the vendor on the phone, sent logs for analysis and so on. So the reality is, the 2hr or 4hr part doesn’t hold much value.

If you have accepted the solution being sold to you OR your an architect recommending a solution which is enterprise grade and highly resilient with self healing capabilities, then consider why you need a 24×7 2hr or 4hr hardware maintenance contract if the solution is architected for the required availability level (i.e.: N+1 / N+2 etc)

So with your next infrastructure purchase (or when making your recommendations if you’re an architect), carefully consider what solution your investing in (or proposing), and if you feel an aggressive 2hr/4hr HW support contract is required, I would recommend revisiting the requirements as you may well be buying (or recommending) something that isn’t resilient enough to meet the requirements.

Food for thought.

Is VAAI beneficial with Virtual Storage Appliance (VSA) based solutions ?

I saw a tweet recently (below) which inspired me to write this post as there is still a clear misunderstanding of the benefits VAAI provides (even with Virtual Storage Appliances).

vaaionvsatweet2

I have removed the identity of the individual who wrote the tweet and the people who retweeted this as the goal of this post is solely to correct what I believe is mis-information.

My interpretation of the tweet was (and remains) if a solution uses a Virtual Storage Appliance (VSA) which resides on the ESXi host then VAAI is not providing any benefits.

My opinion on this topic is:

Compared to a traditional centralised NAS (such as a Netapp or EMC Isilon) providing NFS storage with VAAI-NAS support, a Nutanix or VSA solution has exactly the same benefits from VAAI!

My 1st reply to the tweet was:

vaaionvsatweetconvojoshreply2

The test I was referring to with Netapp OnTap Edge can be found here which was posted in Jan 2013, well prior to my joining Nutanix when I was working for IBM where I had been evangelising VAAI/VCAI based solutions for a long time as VAAI/VCAI provides significant value to VMware customers.

The following shows the persons initial reply to my tweet.

vaaionvsatweetconvo32

I responded with the below mentioning I will do a blog which is what you’re reading now.

I went onto provide some brief replies as shown below.

repliesdetail

The main comments from this persons tweets I would summarize (rightly or wrongly) below:

  • VAAI is designed only to offload functions externally (or off the ESXi host)
  • He/She had not seen any proof of performance advantages from VAAI on VSAs
  • Its broken logic to use VAAI with a VSA

Firstly, I would like comment on VAAI being designed to offload functions externally (or off the ESXi host). I don’t disagree VAAI has some functions designed to offload to the (centralised) array but VAAI also has numerous functions which are designed to bring other efficiencies to a vSphere environment.

An example of a feature designed to offload to a central array is the “XCOPY” primitive.

A simple example of what “XCOPY” or Extended Copy provides is offloading a Storage vMotion on block based storage (i.e.: VMFS over iSCSI,FC,FCoE not NFS) to the array so the ESXi host does not have to process the data movement.

This VAAI primitive would likely be of little benefit in a VSA environment where the storage is presented is block based and Storage DRS for example was used. The data movement would be offloaded from ESXi to the VSA running on ESXi and host would still be burdened with the SvMotion.

However XCOPY is only one of the many primitives of VAAI, and VAAI does alot more than just offload Storage vMotions.

For the purpose of this post, I will be discussing VAAI with Nutanix whos Software defined storage solution runs in a VM on every ESXi host in a Nutanix cluster.
Note: This information is also relevant to other VSAs which support VAAI-NAS.

So what benefit does VAAI provide to Nutanix or a VSA solution running NFS?

Nutanix deploys by default with NFS and supports the VAAI-NAS primitives which are:

  • Full File Clone
  • Fast File Clone
  • Reserve Space
  • Extended Statistics

Note: XCOPY is not supported on NFS, importantly and specifically speaking for Nutanix it is not required as SvMotion will be rarely if ever used with Nutanix solutions.

See my post “Storage DRS and Nutanix – To use, or not to use, that is the question?” for more details on why SvMotion is rarely needed when using Nutanix.

For more details of VAAI primitives, Cormac Hogan (@CormacJHogan) wrote an excellent post which can be found here.

Now here is an example of a significant performance benefits of VAAI with Nutanix.

Lets look at Clone of a VM on a Nutanix platform, the VMs details are below.admin01vm

The VM I have used for this test resides on a datastore called “Management” (as per the above image) which presented via NFS and has VAAI (Hardware Acceleration) enabled as shown below.datastore

Now if I do a simple clone of a VM (as shown below) if the VM is turned on, VAAI-NAS is bypassed as the “Fast File Clone” primitive only works on VMs which are powered off.

clone

So a simple way to test the performance benefits of VAAI on any platform (including Hyper-converged such as Nutanix, a Virtual Storage Appliance (VSA) such as Netapp Ontap Edge or traditional centralised SAN or NAS) is to clone a VM while powered on then shut-down the VM and clone it again.

I performed this test and the first clone with the VM powered on started at 1:17:23 PM and finished at 1:26:12 PM, so a total of 8 mins 49 seconds.

Next I shut down the VM and repeated the clone operation.cloneresults

As we can see in the above screen capture from the 2nd clone started at 1:26:49 PM and finished at 1:26:54 PM, so a total of 5 seconds.

The reason for the huge difference in the speed of the two clones is because VAAI-NAS “Fast File Clone” primitive offloaded the 2nd clone to the Nutanix platform (which runs as a VM on the ESXi host) which has intelligently cloned the VM (using metadata resulting in almost zero data creation) as opposed to 1st clone where VAAI-NAS was not used which resulted in the hypervisor and storage solution having to read 11.18GB of data (being the source VM – Admin01) and write a full copy of the same data resulting in effectively >22GB of data movement in the environment.

Now from a capacity savings perspective, a simple way to demonstrate the capacity savings of VAAI on any platform is to clone a VM multiple times and compare the before and after datastore statistics.

Before I performed this test I captured a baseline of the Management datastore as shown below.

BeforeCloningCapacityVMcount

The above highlighted areas show:

  • Virtual Machines and Templates as 83
  • Capacity 8.49TB
  • Provisioned Space 7.09TB
  • Free Space 7.01TB

I then cloned the Admin01 VM a total of 7 times.clone7vmsrecenttasks

Immediately following the last clone completing I took the below screen shot of the Management datastores statistics.

AfterCloningCapacityVMcount

The above highlighted areas in the updated datastore summary show:

  • Virtual Machines and Templates INCREASED by 7 to 90 (as I cloned 7 VMs)
  • Capacity remained the same at 8.49TB
  • Provisioned Space INCREASED to 7.29TB as we cloned 7 x ~40Gb VMs (Total of ~280GB)
  • Free Space REMAINED THE SAME at 7.01TB due to VAAI-NAS Fast File Clone primitive working with the Nutanix Distributed File System.

So VAAI-NAS allowed a VM of ~11GB of used storage (~40GB provisioned) to be cloned without using any significant additional disk space and the clones were each done in between 5 and 7 seconds each.

So some of the benefits VAAI-NAS provides to Nutanix (which some people would term as a VSA type solution) include:

  • Near instant VM cloning via vSphere Client/s (as shown above)
  • Near instant Horizon View Linked Clone deployments (VCAI) – Similar to example shown.
  • Near instant vCloud Director clones (via FAST Provisioning) – Similar to example shown.
  • Major capacity savings by using Intelligent cloning rather than Full Clones (As shown above)
  • Lower CPU overhead for both ESXi hosts AND Nutanix Controller VM (CVM)
  • Ability to create EagerZeroThick VMDKs on NFS (e.g.: To support Fault Tolerance & clustered workloads such as Oracle RAC)
  • Enhanced ability to get statistics on file sizes , capacity usage etc on NFS

In Summary:

Overall I would say that VMware have developed an excellent API in VAAI and Nutanix along with VSA providers having support for VAAI provides major advantages and value to our joint customers with VMware.

It would be broken logic NOT to leverage the advantages of VAAI regardless of storage type (VSA, Nutanix or traditional centralized SAN/NAS) and for the vast majority of vSphere deployments, any storage solution not supporting (or having issues/bugs with) VAAI will have significant downsides.

I am looking forward to ongoing developments from VMware such as vVols and VASA 2.0 to continue to enhance storage of vSphere solutions in the future.

I hope customers and architects now have the correct information to make the most effective design and purchasing recommendations to meet/exceed customer requirements.

vBrownbag – vForum Sydney 2013 – View Composer for Array Integration (VCAI)

Recently I presented a short vBrownbag on the topic of View Composer for Array Integration (VCAI) with a focus on how VCAI benefits Horizon View environments, Storage Protocol choice for Horizon View deployments and how Nutanix leverage’s VCAI to optimize Horizon View Deployments.

The video is available here – View Composer for Array Integration VCAI) & Nutanix