Example Architectural Decision – VMware HA – Percentage of Cluster resources reserved for HA

Problem Statement

The decision has been made to use “Percentage of cluster resources reserved for HA” admission control setting, and use Strict admission control to ensure the N+1 minimum redundancy level is maintained. However, as most virtual machines do not use  “Reservations” for CPU and/or Memory, the default reservation is only 32Mhz and 0MB+overhead for a virtual machine. In the event of a failure, this level of resources is unlikely to provide sufficient compute to operate production workloads. How can the environment be configured to ensure a minimum level of performance is guaranteed in the event of one or more host failures?

Requirements

1. All Clusters have a minimum requirement of N+1 redundancy
2. In the event of a host failure, a minimum level of performance must be guaranteed

Assumptions

1. vSphere 5.0 or later (Note: This is Significant as default reservation dropped from 256Mhz to 32Mhz, RAM remained at 0MB + overhead)

2. Percentage of Cluster resources reserved for HA is used and set to a value as per Example Architectural Decision – High Availability Admission Control

3. Strict admission control is enabled

4. Target over commitment Ratios are <=4:1 vCPU / Physical Cores | <=1.5 : 1 vRAM / Physical RAM

5. Physical CPU Core speed is >=2.0Ghz

6. Virtual machines sizes in the cluster will vary

7. A limited number of mission critical virtual machines may be set with reservations

8. Average VM size uses >2GB RAM

9. Clusters compute resources will be utilized at >=50%

Constraints

1. Ensuring all compute requirements are provided to Virtual machines during BAU

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity
3. Ensure the target availability and performance is maintained without significantly compromising  over commitment ratios

Architectural Decision

Ensure all clusters remain configured with the HA admission control setting use
“Enable – Do not power on virtual machines that violate availability constraints”

and

Use “Percentage of Cluster resources reserved for HA” for the admission control policy with the percentage value based on the following Architectural Decision – High Availability Admission Control

Configure the following HA Advanced Settings

1. “das.vmMemoryMinMB” with a value of “1024″
2. “das.vmCpuMinMHz” with a value of “512”

Justification

1. Enabling admission control is critical to ensure the required level of availability.
2. The “Percentage of cluster resources reserved for HA” setting allows a suitable percentage value of cluster resources to reserved depending on the size of each cluster to maintain N+1
3.The potentially inefficient slot size calculation used with “Host Failures cluster tolerates” does not suit clusters where virtual machines sizes vary and/or where some mission Critical VMs require reservations

  • 4.
  • Using advanced settings “das.vmCpuMinMHz” & “das.vmMemoryMinMB” allows a minimum level of performance (per VM) to be guaranteed in the event of one or more host failures
  • 5.
  • Advanced settings have been configured to ensure the target over commit ratios are still achieved while ensuring a minimum level of resources in a the event of a host failure
  • 6.
  • Maintains an acceptable minimum level of performance in the event of a host failure without requiring the administrative overhead of setting and maintaining “reservations” at the Virtual machine level
  • 7.
  • Where no reservations are used, and advanced settings not configured, the default reservation would be 32Mhz and 0MB+ memory overhead is used. This would likely result in degraded performance in the event a host failure occurs.

Alternatives

1. Use “Specify a fail over host” and have one or more hosts specified
2. “Host Failures cluster tolerates” and set it to appropriate value depending on hosts per cluster without using advanced settings
3.Use higher Percentage values
4. Use Higher / Lower values for “das.vmMemoryMinMB” and “das.vmCpuMinMHz”
5. Set Virtual machine level reservations on all VMs

Implications

1. The “das.vmCpuMinMHz” advanced setting applies on a per VM basis, not a per vCPU basis, so VMs with multiple vCPUs will still only be guarenteed 512Mhz in a HA event

2. This will reduce the number of virtual machines that can be powered on within the cluster (in order to enforce the HA requirements)

CloudXClogo

 

 

Example Architectural Decision – HA Admission Control Policy with Software licensing constaints

High Availability Admission Control Setting & Policy with a Software Licensing Constraint

Problem Statement

The customer has a requirement to virtualize “Application X” which is currently running on physical servers. The customer is licensed for a maximum of 32 cores and the software vendor has strict licensing restrictions which do not recognize the use of DRS rules to restrict virtual machines to a sub-set of hosts within a cluster.

The application is Tier 1, and requires maximum availability. A capacity planner assessment has been conducted and found 32 cores and 256Gb RAM is sufficient to run all servers.

The servers requirements vary greatly from 1vCPU/2GB RAM to 8vCPU/64GB Ram with the bulk of the VMs 2vCPU or less with varying RAM sizes.

What is the most suitable hardware configuration and HA admission control policy / setting  that complies with the licensing restrictions while ensuring N+1 redundancy and minimizing the change of poor application performance?

Assumptions

1. None

Constraints

1. Software vendor has strict licensing requirements
2. Only 32 cores are licensed and the customer has no budget for further licenses
3. DRS rules cannot be used to isolate VMs onto one or more hosts due to software licensing agreement

Motivation

1. Ensure maximum availability for the Tier 1 application/s
2. Ensure optimal performance for Tier 1 application/s

Architectural Decision

Purchase a total of three (3) x Two (2) Way Servers, with 8 core CPUs and 128GB Ram each and form a cluster of three nodes.

For the HA Admission control setting use “Enable – Do not power on virtual machines that violate availability constraints”

For the HA admission control policy use “Specify a Failover Host” and select the third host in the cluster. (Leaving two active hosts in the cluster).

Justification

1. Enabling strict admission control is critical to ensure the required level of availability for the Tier 1 application
2. Ensure maximum CPU scheduling efficiency by having two hosts active within the cluster running virtual machines as opposed to a single large host
3. Having 2 active hosts in the cluster allows DRS some flexibility to load balance to resolve contention compared to using a single large 32 core host
4. N+1 redundancy is achieved as one host can fail and the “fail-over” host will become active and be able to take the failed hosts workloads without performance degrading
5. As only 32 cores ( 2 servers with 16 cores each) are active at any one time, the solution complies with the licensing constraint
6. Using CPUs with smaller numbers of cores (such as 5 x 2 way servers with 4 cores per socket) would result in larger VMs not fitting within NUMA nodes and potentially impacting memory performance. Although, with vNUMA in vSphere 5.0 this would be less of an issue.
7. All VMs will fit within a NUMA node thus giving the VMs maximum performance without the requirement for vNUMA which is only available in vSphere 5.0 or later
8. The compute resource supplied by the proposed cluster is sufficient to run the workloads as per the capacity planner assessment.

Implications

1. Additional networking and storage ports for three hosts as opposed to a two host cluster
2. If additional compute is required in the cluster, additional software licenses would need to be purchased. Alternativley if the application servers were redesigned to use a scale out methodology (especially for VMs with 4-8vCPUs) it would likley result in higher overcommitment ratios without significant contention and better utilization of the existing licensed cores
3. One host is sitting as a hot standby not servicing customer workloads and may be considered to be “waste”

Alternatives

1. Use 2 x 4 way 8 core ESXi hosts (32 cores per host) and set HA admission control to specify a fail over host
2. Use 5 x 2 Way 4 core ESXi hosts (8 cores per host) and set HA admission control to specify a fail over host

The Below is a basic diagram of the proposed solution.

FailoverHost

*Post updated February 11th to correct an error.