Example Architectural Decision – VMware HA – Percentage of Cluster resources reserved for HA

Problem Statement

The decision has been made to use “Percentage of cluster resources reserved for HA” admission control setting, and use Strict admission control to ensure the N+1 minimum redundancy level is maintained. However, as most virtual machines do not use  “Reservations” for CPU and/or Memory, the default reservation is only 32Mhz and 0MB+overhead for a virtual machine. In the event of a failure, this level of resources is unlikely to provide sufficient compute to operate production workloads. How can the environment be configured to ensure a minimum level of performance is guaranteed in the event of one or more host failures?

Requirements

1. All Clusters have a minimum requirement of N+1 redundancy
2. In the event of a host failure, a minimum level of performance must be guaranteed

Assumptions

1. vSphere 5.0 or later (Note: This is Significant as default reservation dropped from 256Mhz to 32Mhz, RAM remained at 0MB + overhead)

2. Percentage of Cluster resources reserved for HA is used and set to a value as per Example Architectural Decision – High Availability Admission Control

3. Strict admission control is enabled

4. Target over commitment Ratios are <=4:1 vCPU / Physical Cores | <=1.5 : 1 vRAM / Physical RAM

5. Physical CPU Core speed is >=2.0Ghz

6. Virtual machines sizes in the cluster will vary

7. A limited number of mission critical virtual machines may be set with reservations

8. Average VM size uses >2GB RAM

9. Clusters compute resources will be utilized at >=50%

Constraints

1. Ensuring all compute requirements are provided to Virtual machines during BAU

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity
3. Ensure the target availability and performance is maintained without significantly compromising  over commitment ratios

Architectural Decision

Ensure all clusters remain configured with the HA admission control setting use
“Enable – Do not power on virtual machines that violate availability constraints”

and

Use “Percentage of Cluster resources reserved for HA” for the admission control policy with the percentage value based on the following Architectural Decision – High Availability Admission Control

Configure the following HA Advanced Settings

1. “das.vmMemoryMinMB” with a value of “1024″
2. “das.vmCpuMinMHz” with a value of “512”

Justification

1. Enabling admission control is critical to ensure the required level of availability.
2. The “Percentage of cluster resources reserved for HA” setting allows a suitable percentage value of cluster resources to reserved depending on the size of each cluster to maintain N+1
3.The potentially inefficient slot size calculation used with “Host Failures cluster tolerates” does not suit clusters where virtual machines sizes vary and/or where some mission Critical VMs require reservations

  • 4.
  • Using advanced settings “das.vmCpuMinMHz” & “das.vmMemoryMinMB” allows a minimum level of performance (per VM) to be guaranteed in the event of one or more host failures
  • 5.
  • Advanced settings have been configured to ensure the target over commit ratios are still achieved while ensuring a minimum level of resources in a the event of a host failure
  • 6.
  • Maintains an acceptable minimum level of performance in the event of a host failure without requiring the administrative overhead of setting and maintaining “reservations” at the Virtual machine level
  • 7.
  • Where no reservations are used, and advanced settings not configured, the default reservation would be 32Mhz and 0MB+ memory overhead is used. This would likely result in degraded performance in the event a host failure occurs.

Alternatives

1. Use “Specify a fail over host” and have one or more hosts specified
2. “Host Failures cluster tolerates” and set it to appropriate value depending on hosts per cluster without using advanced settings
3.Use higher Percentage values
4. Use Higher / Lower values for “das.vmMemoryMinMB” and “das.vmCpuMinMHz”
5. Set Virtual machine level reservations on all VMs

Implications

1. The “das.vmCpuMinMHz” advanced setting applies on a per VM basis, not a per vCPU basis, so VMs with multiple vCPUs will still only be guarenteed 512Mhz in a HA event

2. This will reduce the number of virtual machines that can be powered on within the cluster (in order to enforce the HA requirements)

CloudXClogo

 

 

Example Architectural Decision – Storage DRS Configuration for VMFS Datastores in a vCloud Environment

Problem Statement

In a production , self service vCloud Director environment, What is the most suitable Storage DRS configuration to improve storage utilization , performance, as well as reduce administrative effort for BAU staff?

Requirements

1. Make the most efficient use of the available storage capacity
2. Maintain consistent level of storage performance
3. Reduce the risk and overhead of capacity management
4. Reduce the risk of a unintentional or otherwise DoS event caused by self service

Assumptions

1. vSphere 5.0 or later
2. VMFS 5 Datastores which are Thick Provisioned
3. Deduplication is not in use
4. VAAI is supported by the array and enabled across the vSphere environment
5. All datastores in each respective Datastore clusters reside on the same RAID type with similar spindle types and count
6. All datastores are presented to all hosts within the cluster
7. Array level snapshots are not in use
8. IBM SVC Storage is being used
9. vCloud Director 5.1 or later
10. Storage I/O Control is enabled at set to 30ms

Constraints

1. IBM SVC storage does not currently support VASA (VMware API for Storage Awareness)

Motivation

1. Ensure production storage performance is not negatively impacted
2. Minimize the vSphere administrators workload where possible

Architectural Decision

Set the DRS automation setting to “Fully Automated”

  • Set “Utilized Space” threshold to 80%
  • Set “I/O latency” to 15ms
  • I/O Metric Inclusion – Enabled

Advanced Options

  • No recommendations until utilization difference between source and destination is: 10%
  • Evaluate I/O load every 8 Hours
  • I/O Imbalance threshold  4

Justification

1. Setting Storage DRS to “Fully Automated”  ensures that the administrator does not need to be concerned with initial placement of virtual machines as this will be dynamically and intelligently determined and executed

2. “XCOPY” is fully supported for Block based storage, as such, any Storage vMotion activity is offloaded to the array therefore removing the I/O overhead on the compute and storage fabric.

3. Where a significant I/O imbalance is detected by SDRS, the vSphere administrator is not required to take any action, Storage DRS can identify and remediate issues which fall outside parameters (which are determined by the VMware Architect) automatically. This improves the efficiency of the environment, and reduces the involvement of BAU.

4. SDRS provides valuable “initial placement” for new virtual machines which will help avoid a situation where datastores are unevenly balanced from a capacity perspective in the first place, therefore reducing the chance of virtual machines requiring migration.

5. Setting the “No recommendations until utilization difference between source and destination is” to 20% ensures that SDRS does not move virtual machines around where significant benefit is not realized  This prevents unnecessary Storage vMotion activity on the disk system, although this is offloaded from the host to the array, the I/O still may impact production performance for workloads on the same disk system.

6. Setting the “I/O Imbalance threshold (5 Aggressive / Conservative 1 ) to “4” (2nd most aggressive)  ensures that I/O imbalance should be addressed before significant imbalance is experienced by the end users. This level of “ aggressiveness” is acceptable as the Storage vMotion can be offloaded (via VAAI “XCOPY” primitive  and has almost zero impact on the host.  Setting this to “5” may result in minor I/O imbalances being corrected, at the cost of a Storage vMotion and as a result the impact of the more frequent Storage vMotion activity may negate the benefit of the I/O balancing.

7. Storage DRS will address I/O imbalance across the datastore cluster if the latency meets or exceeds the set value of 15ms (the default) and in the event of latency increasing during peak times to >=30ms , Storage I/O Control will ensure fair acess to the storage.

Alternatives

1. Use “No Automation (Manual Mode)”
2. Not use Storage DRS

Implications

1. When selecting datastores for the datastore cluster, having VASA enabled allows the “System Capability” column to be populated in the “New Datastore Cluster” wizard to ensure suitable datastores of similar performance, RAID type and features are grouped together. VASA is currently NOT supported by SVC, as such the datastore naming convension needs to accurately reflect the capabilities of the LUN/s to ensure suitable datastores are grouped together.

vmware_logo_ads

vCloud Suite 5.1 Upgrade Guide

I just came across an unofficial vCloud Suite 5.1 upgrade guide by Jad El-Zem which covers off the steps involved and a few gotchas to watch out for.

VMware Blogs – vCloud Suite 5.1 Solution Upgrade Guide