Example Architectural Decision – VMware HA – Percentage of Cluster resources reserved for HA

Problem Statement

The decision has been made to use “Percentage of cluster resources reserved for HA” admission control setting, and use Strict admission control to ensure the N+1 minimum redundancy level is maintained. However, as most virtual machines do not use  “Reservations” for CPU and/or Memory, the default reservation is only 32Mhz and 0MB+overhead for a virtual machine. In the event of a failure, this level of resources is unlikely to provide sufficient compute to operate production workloads. How can the environment be configured to ensure a minimum level of performance is guaranteed in the event of one or more host failures?

Requirements

1. All Clusters have a minimum requirement of N+1 redundancy
2. In the event of a host failure, a minimum level of performance must be guaranteed

Assumptions

1. vSphere 5.0 or later (Note: This is Significant as default reservation dropped from 256Mhz to 32Mhz, RAM remained at 0MB + overhead)

2. Percentage of Cluster resources reserved for HA is used and set to a value as per Example Architectural Decision – High Availability Admission Control

3. Strict admission control is enabled

4. Target over commitment Ratios are <=4:1 vCPU / Physical Cores | <=1.5 : 1 vRAM / Physical RAM

5. Physical CPU Core speed is >=2.0Ghz

6. Virtual machines sizes in the cluster will vary

7. A limited number of mission critical virtual machines may be set with reservations

8. Average VM size uses >2GB RAM

9. Clusters compute resources will be utilized at >=50%

Constraints

1. Ensuring all compute requirements are provided to Virtual machines during BAU

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity
3. Ensure the target availability and performance is maintained without significantly compromising  over commitment ratios

Architectural Decision

Ensure all clusters remain configured with the HA admission control setting use
“Enable – Do not power on virtual machines that violate availability constraints”

and

Use “Percentage of Cluster resources reserved for HA” for the admission control policy with the percentage value based on the following Architectural Decision – High Availability Admission Control

Configure the following HA Advanced Settings

1. “das.vmMemoryMinMB” with a value of “1024″
2. “das.vmCpuMinMHz” with a value of “512”

Justification

1. Enabling admission control is critical to ensure the required level of availability.
2. The “Percentage of cluster resources reserved for HA” setting allows a suitable percentage value of cluster resources to reserved depending on the size of each cluster to maintain N+1
3.The potentially inefficient slot size calculation used with “Host Failures cluster tolerates” does not suit clusters where virtual machines sizes vary and/or where some mission Critical VMs require reservations

  • 4.
  • Using advanced settings “das.vmCpuMinMHz” & “das.vmMemoryMinMB” allows a minimum level of performance (per VM) to be guaranteed in the event of one or more host failures
  • 5.
  • Advanced settings have been configured to ensure the target over commit ratios are still achieved while ensuring a minimum level of resources in a the event of a host failure
  • 6.
  • Maintains an acceptable minimum level of performance in the event of a host failure without requiring the administrative overhead of setting and maintaining “reservations” at the Virtual machine level
  • 7.
  • Where no reservations are used, and advanced settings not configured, the default reservation would be 32Mhz and 0MB+ memory overhead is used. This would likely result in degraded performance in the event a host failure occurs.

Alternatives

1. Use “Specify a fail over host” and have one or more hosts specified
2. “Host Failures cluster tolerates” and set it to appropriate value depending on hosts per cluster without using advanced settings
3.Use higher Percentage values
4. Use Higher / Lower values for “das.vmMemoryMinMB” and “das.vmCpuMinMHz”
5. Set Virtual machine level reservations on all VMs

Implications

1. The “das.vmCpuMinMHz” advanced setting applies on a per VM basis, not a per vCPU basis, so VMs with multiple vCPUs will still only be guarenteed 512Mhz in a HA event

2. This will reduce the number of virtual machines that can be powered on within the cluster (in order to enforce the HA requirements)

CloudXClogo

 

 

High Availability Admission Control Setting and Policy

High Availability Admission Control Setting and Policy

Problem Statement

In a self service IaaS cloud, the virtual machine numbers and compute requirements will likely vary significantly and without notice. The environment needs to achieve the maximum consolidation ratio possible without impacting the ability to provide redundancy at a minimum of

1. N+1 for clusters of up to 8 hosts
2. N+2 for clusters of >8 hosts but <=16
3. N+3 for clusters of >16 hosts but <=24
4. N+4 for clusters of >24 hosts but <=32

What is the most efficient HA admission control policy / setting and configuration for the vSphere cluster?

Assumptions

1. Virtual machine workloads will vary from small ie: 1 vCPU / 1GB RAM up to large VMs of >=8vCPU and >=64GB Ram
2. Redundancy is mandatory as per the problem statement
3. ESXi hosts can support the maximum VM size required by the offering
4. vSphere 5.0 or later is being used

Motivation

1. Ensure maximum consolidation ratios in the cluster
2. Ensure optimal compute resource utilization
3. Prevent HA overhead being increased by the potentially inefficient slot size based HA algorithms
4. Make maximum use of hardware investment

Alternatives

1. Use “Specify a fail over host”
2. Set “Host failure cluster tolerates” to 1, 2,3 or 4 depending on cluster size

Justification

1. Enabling admission control is critical to ensure the required level of availability
2. The admission control settings that rely on the Slot size based HA algorithms do not suit clusters with varying VM sizes
3. Percentage setting being rounded up adds minimal additional HA overhead and helps ensure performance in a HA event
4. Ensure maximum CPU scheduling efficiency by having all hosts within the cluster running virtual machines
5. Ensure optimal DRS flexibility by having all hosts within the cluster active to be able to run virtual machines

Architectural Decision

For the HA Admission control setting use “Enable – Do not power on virtual machines that violate availability constraints”

For the HA admission control policy use “Percentage of cluster resources reserved for HA” and set the percentage of cluster resources as per the below table.

Note: Percentage values that do not equate to a full number will be rounded up.

HAPercentages

Note: Check out this cool HA admission control percentage calculator by Samir Roshan of ThinkingLoudOnCloud

Implications

1. The Percentage of cluster resources reserved for HA uses VM level CPU and Memory reservations to calculate cluster capacity. If not reservations are set performance in the event of a failure may be impacted
2. The default Mhz reserved for HA is 32mhz per VM –  CPU Reservations should be considered for critical VMs to ensure performance is not significantly degraded in a HA event
3. The default memory reserved for HA is 0MB + VM overhead – Memory reservations should be considered for critical VMs to ensure performance is not significantly degraded in a HA event