Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

In recent weeks, I have seen numerous RFQs which have the requirement for 24×7 2 or 4hr onsite HW replacement, and while this is not uncommon I’ve been thinking why is this the case?

Over my I.T career spanning coming up on 15 years, in the majority of cases, I have strongly recommended in my designs and Bill of Materials (BoMs) that customers buy 24×7 4 hours onsite hardware maintenance contracts for equipment such as Compute, Storage Arrays , Storage Area Networking and IP network devices.

I have never found it difficult to justify this recommendation, because traditionally if a component in the datacenter fails, such as a Storage Controller, this generally has a high impact on the customers business and could cost tens or hundreds of thousands of dollars or even millions of dollars in revenue depending on the size of the customer.

Not only is loosing a Storage controller general a high impact, it is also a high risk as the environment may no longer have redundancy and a subsequent failure could (and would likely) result in a full outage.

So in this example, a typical storage solution has a Storage Controller failure resulting in degraded performance (due to loosing 50% of the controllers) and high impact/risk to the customer, a customer purchasing 24×7 4 Hour, or even 24×7 2hr support contract makes perfect sense! The question is, why choose HW (or a solution) which puts you at high risk after a single component failure in the first place?

With technology fast changing and over the last year or so, I’ve been involved in many customer meetings where I am asked what I recommend in terms of hardware maintenance contracts (for Nutanix customers).

Normally this question/conversation happens after the discussion about the technology, where I explain various failure scenarios and how resilient a Nutanix cluster is.

My recommendation goes something like this.

If you architect your solution for your desired level of availability (e.g.: N+2) there is no need to buy 24×7 4hr hardware maintenance contract, the default Next Business Day option is perfectly fine.

Justification:

1. In the event of even an entire node failure, the Nutanix cluster will have automatically self healed the configured resiliency factor (2 or 3) well before even a 2hr support contract can provide a technician to be onsite, diagnose the issue and replace hardware.

2. Assuming the HW is replaced on the 2hr mark (not typical in my experience), AND assuming Nutanix was not automatically self healing prior to the drive/node replacement, the replacement drive or node would then START the process of self healing. So the actual time to recovery would be greater than 2hrs. In the case of Nutanix, self heal begins almost immediately.

3. If a cluster is sized for the desired level of availability based on business requirements, say N+2, a Node can fail, Nutanix will automatically self heal and then tolerate a subsequent failure with the ability to full self heal the configured resiliency factor (2 or 3) again.

4. If a cluster is sized only to customer requirement of only N+1, a Node can fail, Nutanix will automatically and fully self heal. Then in the unlikely (but possible) event of a subsequent failure (i.e.: A 2nd node failure before the next business day warranty replaces the failed HW), the Nutanix cluster will still continue to operate.

5. The performance impact of a node failure in a Nutanix environment is N-1, so in a worst case scenario (3 node cluster) the impact is 33%, compared to a 2 controller SAN/NAS where the impact would be 50%. In a 4 node cluster the impact is only 25% and for customer with say 8 nodes only 12.5%. The bigger the cluster the lower the impact. Nutanix recommends N+1 up to 16 nodes, and N+2 up to 32 nodes. Beyond 32 nodes higher levels of availability may be desired based on customer requirements.

The risk and impact of the failure scenario/s is key, in the case of Nutanix, because of the self healing capability, and the fact all controllers and SSDs/HDDs in the cluster participate in the self heal, it can be done very quickly and with low impact. So the impact of the failure is low (N-1) and the recovery is done quickly, so the risk to the business is low, therefore dramatically reducing (and in my opinion potentially removing) the requirement for a 24×7 2 or 4hr support contract for Nutanix customers.

In Summary:

1. The decision on what hardware maintenance contract is appropriate is a business level decision which should be based in part on a comprehensive risk assessment done by an experienced enterprise architect, intimately familiar with all the technology being used.

2. If the recommendation from the trusted experienced enterprise architect is that the risk of HW failure causing high impact or outage to the business is so high that purchasing a 4hr or 2hr onsite HW replacement is required, my advise would be to reconsider if the proposed “solution” meets the business requirements. Only if you are constrained to that solution, purchase a 24×7 2 or 4hr support contract.

3. Being heavily dependant on Hardware being replaced to restore resiliency / performance for a solution, is in itself a high risk to the business.

AND

4. In my experience, it is not uncommon to have problems getting onsite support or hardware replacement regardless of the support contract / SLA. Sometimes this is outside a vendors control, but most vendors will experience one or more of these issues which I have personally experienced on numerous occasions in previous roles:

a) Vendors failing to meet SLA for onsite support.
b) Vendors failing to have the required parts available within the SLA.
c) Replacement HW being refurbished (common practice) and being faulty.
d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

Note: Support contracts don’t promise a resolution by the 2hr / 4hr contract, they simply promise somebody will be onsite and in some cases this is only after you have gone through troubleshooting with the vendor on the phone, sent logs for analysis and so on. So the reality is, the 2hr or 4hr part doesn’t hold much value.

If you have accepted the solution being sold to you OR your an architect recommending a solution which is enterprise grade and highly resilient with self healing capabilities, then consider why you need a 24×7 2hr or 4hr hardware maintenance contract if the solution is architected for the required availability level (i.e.: N+1 / N+2 etc)

So with your next infrastructure purchase (or when making your recommendations if you’re an architect), carefully consider what solution your investing in (or proposing), and if you feel an aggressive 2hr/4hr HW support contract is required, I would recommend revisiting the requirements as you may well be buying (or recommending) something that isn’t resilient enough to meet the requirements.

Food for thought.

Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for QA / Pre-Production Servers

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for an environment running Quality Assurance or Pre-Production server workloads?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. Average Server VM is between 2-4vCPU and 4-8GB Ram with some larger.
5. Memory is the first compute level constraint.
6. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
7. vSphere 5.5 or earlier

Requirements

1. The environment must deliver consistent performance
2. Minimize the cost of shared storage

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Leave TPS disabled (default) and leave Large Memory pages enabled (default).

Justification

1. QA/Pre-Production environments should be as close as possible to the configuration of the actual production environment. This is to ensure consistency between QA/Pre-Production validation and production functionality and performance.
2. Setting 100% memory reservations ensures consistent performance by eliminating the possibility of swapping.
3. The 100% memory reservation also eliminates the capacity usage by the vswap file which saves space on the shared storage as well as reducing the impact on the storage in the event of swapping.
4. RAM is cheaper than Tier 1 storage (which is recommended for vSwap storage to ensure minimal performance impact during swapping) so the increased cost of memory in the hosts is easily offset by the saving in Tier 1 shared storage.
5. Simplicity. Leaving default settings is advantageous from both an architectural and operational perspective.  Example: ESXi Patching can cause settings to revert to default which could negate TPS savings and put a sudden high demand on storage where TPS savings are expected.
6. TPS savings for server workloads is typically much less than with desktop workloads and as a result less attractive.
7. The decision has been made to use 2 socket ESXi hosts and scale out so the TPS savings per host compared to a 4 socket server with double the RAM will be lower.
8. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM leading to more consistent performance under a wider range of circumstances.
9. Lower core count (and lower cost) CPUs will likely be viable as RAM will likely be the first constraint for further consolidation.
10. Remove the real or perceived security risk of sensitive information being gathered from other VMs using TPS as described in VMware KB 2080735

Implications

1. Using 100% memory reservations requires ESXi hosts and the cluster be sized at a 1:1 ratio of vRAM to pRAM (Physical RAM) and should include N+1 so a host failure can be tolerated.
2. Increased RAM costs
3. No memory overcommitment can be achieved
4. Potential for lower CPU utilization / overcommitment as RAM may become the first constraint.

Alternatives

1. Use 50% reservation and enable TPS
2. Use no reservation, Enable TPS and disable large pages

Related Articles:

1. Transparent Page Sharing (TPS) Example Architectural Decisions Register

2. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl (VCDX#104)

Transparent Page Sharing (TPS) Example Architectural Decisions Register

The following is a register of all Example Architectural Decisions related to Transparent Page Sharing on VMware ESXi following the announcement from VMware that TPS will be disabled by default in future patches and versions.

See The Impact of Transparent Page Sharing (TPS) being disabled by default for more information.

The goal of this series is to give the pros and cons for multiple options for the configuration of TPS for a wide range of virtual workloads from VDI, to Server, Business Critical Apps , Test/Dev and QA/Pre-Production.

Business Critical Applications (vBCA) :

1. Transparent Page Sharing (TPS) Configuration for Virtualized Business Critical Applications (vBCA)

Mixed Server Workloads:

1. Transparent Page Sharing (TPS) Configuration for Production Servers (1 of 2)

2. Transparent Page Sharing (TPS) Configuration for Production Servers (2 of 2) – Coming Soon!

Virtual Desktop (VDI) Environments:

1. Transparent Page Sharing (TPS) Configuration for VDI (1 of 2)

2. Transparent Page Sharing (TPS) Configuration for VDI (2 of 2)

Testing & Development:

1. Transparent Page Sharing (TPS) Configuration for Test/Dev Servers (1 of 2) – Coming Soon!

2. Transparent Page Sharing (TPS) Configuration for Test/Dev Servers (2 of 2) – Coming Soon!

QA / Pre-Production:

1. Transparent Page Sharing (TPS) Configuration for QA / Pre-Production Servers

Related Articles:

1. Example Architectural Decision Register

2. The Impact of Transparent Page Sharing (TPS) being disabled by default – @josh_odgers (VCDX#90)

3. Future direction of disabling TPS by default and its impact on capacity planning – @FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl (VCDX#104)