Example Architectural Decision – vSphere configuration for handling APD/PDL scenarios

Problem Statement

What is the best way to configure the vSphere environment to handle All Paths Down (APD) and Permanent Device Loss (PDL) situations where the environment uses Active/Active (IBM SVC) storage with FC connectivity via a dedicated highly available Storage Area Network (SAN) fabric?

Requirements

1. Ensure in the event of storage issues the impact to the vSphere environment is minimized.
2. Where possible have the environment automatically respond in the event of storage problems

Assumptions

1. vSphere 5.1 or later
2. The Storage Area Network (SAN)  fabric is highly available (>99.999% availability)
3. All storage is FC (block) based via an Active/Active Disk array (IBM SVC disk system)
4. All ESXi hosts have storage connectivity via multiple HBAs
5. All ESXi hosts are connected to two (2) physically separate FC switches
6. The Path Selection Plugin (PSP) being used is “VMW_PSP_RR” (Round Robin)

Constraints

1. None

Motivation

1. Minimize impact of APD and PDL situations

Architectural Decision

Configure the following advanced settings

Set “Misc.APDHandlingEnable” to 1 (0 is default which is Disabled)
Set “Misc.APDTimeout” to 20 (140 seconds is default)

Set “disk.terminateVMOnPDLDefault” to 1 (Enabled)
Set “das.maskCleanShutdownEnabled” to 1 (Enabled)

Justification

1. The storage array (IBM SVC) operates in an Active/Active manor where the Path Selection Plugin (PSP) is either “VMW_PSP_RR” (Round Robin), “VMW_PSP_MRU” (Most Recently Used) OR “VMW_PSP_FIXED_AP” (Note: Now included in VMW_PSP_FIXED in vSphere 5.1), in the event of one or more path failures, the PSP will handle this event and use a working path. Where an APD situation occurs in a highly available SAN fabric it is likely the issue is a catastrophic failure and it is ideal to terminate I/O as soon as possible. As such lowering the “Misc.APDTimeout” to 20 (minimum) allows for a short outage but does not allow the VM to continue attempting I/O where it cannot be committed to disk.

2. After 20 seconds, any I/O from the VMs will be “fast-failed” with a status of “No_Connect” to prevent “hostd” worker threads being exhausted and causing the “hostd” service to become hung thus increasing resiliency at the ESXi layer.

3. In the event not all hosts in the cluster are impacted by the PDL, HA can detect the PDL on one (or more) hosts and restart the virtual machines on one of the hosts in the cluster which do not have the PDL state on the datastore/s

  • 4. Having “disk.terminateVMOnPDLDefault” enabled , ensures VMs are shutdown in a PDL event
  • 5.

  • The “das.maskCleanShutdownEnabled” setting allows VMs shutdown as a result of a PDL to be automatically restarted by HA

5. Setting the Misc.APDTimeout to “20” does not impact the storage connectivity even in the event of a single SVC cluster node failing as all Storage is Active on all SVC cluster nodes. Note: Half the paths would be lost in the event of a failed SVC cluster node but this does not constitute an APD situation.

Alternatives

1. Leave “Misc.APDHandlingEnable” at 0 (default)
2. Leave “Misc.APDTimeout” at 140 (default) OR set a higher or lower value (20 Min / 99999 Max)
3. Set “das.maskCleanShutdownEnabled” to Disabled
4. Set “disk.terminateVMOnPDLDefault” to 0 (Disabled)
5. Various combinations of the above

Implications

1. After 20 seconds, any I/O from the VMs will be “fast-failed” with a status of “No_Connect”., in the unlikely event of an outage lasting >20 seconds manual intervention will be required.
2. In the event of APD situation, Virtual machines will not be restarted by HA even where other ESXi hosts are not impacted by the APD situation
3. Due to the nature of an APD situation, there is no clean way to recover. Once the issue is resolved at the SAN fabric or disk system layer, ESXi hosts may need to be rebooted.

Related Articles

1. Advanced Configuration options for VMware High Availability in vSphere 5.0 and 5.1 (2033250)

CloudXClogo

 

 

Example Architectural Decision – VMware HA – Percentage of Cluster resources reserved for HA

Problem Statement

The decision has been made to use “Percentage of cluster resources reserved for HA” admission control setting, and use Strict admission control to ensure the N+1 minimum redundancy level is maintained. However, as most virtual machines do not use  “Reservations” for CPU and/or Memory, the default reservation is only 32Mhz and 0MB+overhead for a virtual machine. In the event of a failure, this level of resources is unlikely to provide sufficient compute to operate production workloads. How can the environment be configured to ensure a minimum level of performance is guaranteed in the event of one or more host failures?

Requirements

1. All Clusters have a minimum requirement of N+1 redundancy
2. In the event of a host failure, a minimum level of performance must be guaranteed

Assumptions

1. vSphere 5.0 or later (Note: This is Significant as default reservation dropped from 256Mhz to 32Mhz, RAM remained at 0MB + overhead)

2. Percentage of Cluster resources reserved for HA is used and set to a value as per Example Architectural Decision – High Availability Admission Control

3. Strict admission control is enabled

4. Target over commitment Ratios are <=4:1 vCPU / Physical Cores | <=1.5 : 1 vRAM / Physical RAM

5. Physical CPU Core speed is >=2.0Ghz

6. Virtual machines sizes in the cluster will vary

7. A limited number of mission critical virtual machines may be set with reservations

8. Average VM size uses >2GB RAM

9. Clusters compute resources will be utilized at >=50%

Constraints

1. Ensuring all compute requirements are provided to Virtual machines during BAU

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity
3. Ensure the target availability and performance is maintained without significantly compromising  over commitment ratios

Architectural Decision

Ensure all clusters remain configured with the HA admission control setting use
“Enable – Do not power on virtual machines that violate availability constraints”

and

Use “Percentage of Cluster resources reserved for HA” for the admission control policy with the percentage value based on the following Architectural Decision – High Availability Admission Control

Configure the following HA Advanced Settings

1. “das.vmMemoryMinMB” with a value of “1024″
2. “das.vmCpuMinMHz” with a value of “512”

Justification

1. Enabling admission control is critical to ensure the required level of availability.
2. The “Percentage of cluster resources reserved for HA” setting allows a suitable percentage value of cluster resources to reserved depending on the size of each cluster to maintain N+1
3.The potentially inefficient slot size calculation used with “Host Failures cluster tolerates” does not suit clusters where virtual machines sizes vary and/or where some mission Critical VMs require reservations

  • 4.
  • Using advanced settings “das.vmCpuMinMHz” & “das.vmMemoryMinMB” allows a minimum level of performance (per VM) to be guaranteed in the event of one or more host failures
  • 5.
  • Advanced settings have been configured to ensure the target over commit ratios are still achieved while ensuring a minimum level of resources in a the event of a host failure
  • 6.
  • Maintains an acceptable minimum level of performance in the event of a host failure without requiring the administrative overhead of setting and maintaining “reservations” at the Virtual machine level
  • 7.
  • Where no reservations are used, and advanced settings not configured, the default reservation would be 32Mhz and 0MB+ memory overhead is used. This would likely result in degraded performance in the event a host failure occurs.

Alternatives

1. Use “Specify a fail over host” and have one or more hosts specified
2. “Host Failures cluster tolerates” and set it to appropriate value depending on hosts per cluster without using advanced settings
3.Use higher Percentage values
4. Use Higher / Lower values for “das.vmMemoryMinMB” and “das.vmCpuMinMHz”
5. Set Virtual machine level reservations on all VMs

Implications

1. The “das.vmCpuMinMHz” advanced setting applies on a per VM basis, not a per vCPU basis, so VMs with multiple vCPUs will still only be guarenteed 512Mhz in a HA event

2. This will reduce the number of virtual machines that can be powered on within the cluster (in order to enforce the HA requirements)

CloudXClogo

 

 

Example Architectural Decision – BC/DR Solution for vCloud Director

Problem Statement

What is the most suitable BC/DR solution for a vCloud director environment?

Requirements

1. Ensure the vCloud solution can tolerate a site failure in an automated manner
2. Ensure the vCloud solution meets/exceeds the RTO of 4hrs
3. Comply with all requirements of the Business Continuity Plan (BCP)
4. Solution must be a supported vSphere / vCloud Configuration
5. Ensure all features / functionality of the vCloud solution are available following a DR event

Assumptions

1. Datacenters are in an Active/Active configuration
2. Stretched Layer 2 network across both datacenters
3. Storage based replication between sites
4. vSphere 5.0 Enterprise Plus or later
5. VMware Site Recovery Manager 5.0 or later
6, vCloud Director 1.5 or later
7. There is no requirement for workloads proposed to be hosted in vCloud to be at one datacenter or another

Constraints

1. The hardware for the solution has already been chosen and purchased. 6 x 4 Way, 32 core Hosts w/ 512GB RAM and 4 x 10GB
2. The storage solution is already in place and does not support a Metro Storage Cluster (vMSC) configuration

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity

Architectural Decision

Use the vCloud DR solution as described in the “vCloud Director Infrastructure Resiliency Case Study” (By Duncan Epping @duncanyb and Chris Colotti @Ccolotti )

In Summary, Host the vSphere/vCloud Management virtual machines on an SRM protected cluster.

Use a dedicated cluster for vCloud compute resources.

Configure the vSphere cluster which is dedicated to providing compute resources to the vCloud environment (Provider virtual data center – PvDC) to have four (4) compute nodes  located at Datacenter A for production use and two (2) compute nodes located at Datacenter B (in ”Maintenance mode”) dedicated to DR.

Storage will not be stretched across sites; LUNs will be presented locally from “Datacenter A” shared storage to the “Datacenter A” based hosts. The “Datacenter A” storage will be replicated synchronously to “Datacenter B” and presented from “Datacenter B” shared storage to the two (2) “Datacenter B” based hosts. (No stretched Storage between sites)

In the event of a failure, SRM will recover the vSphere/vCloud Management virtual machines bringing back online the Cloud, then a script as the last part of the SRM recovery plan, Mounts the replicated storage to the ESXi hosts in “Datacenter B” and takes the two (2) hosts at “Datacenter B” out of maintenance mode. HA will then detect the virtual machines and power on them on.

Justification

1. Stretched Clusters are more suited to Disaster Avoidance than Disaster Recovery
2. Avoids complex and manual  intervention in the case of a disaster in the case of a stretched cluster solution
3. A Stretched cluster provides minimal control in the event of a Disaster where as in this case, HA simply restarts VMs once the storage is presented (automatically) and the hosts are taken out of Maintenance mode (also automated)
4. Having  two (2) ESXi hosts for the vCloud resource cluster setup in “Datacenter B” in “Maintenance Mode” and the storage mirrored as discussed  allows the virtual workloads to be recovered in an automated fashion as part of the VMware Site Recovery Manager solution.
5. Removes the management overhead of managing a strecthed cluster using features such as DRS affinity rules to keep VMs on the hosts on the same site as the storage
6. vSphere 5.1 backed resource clusters support >8 host clusters for “Fast provisioning”
7. Remove the dependency on the Metropolitan Area Data and Storage networks during BAU and the potential impact of the latency between sites on production workloads
8. Eliminates the chance of a “Split Brain” or a “Datacenter Partition” scenario where VM/s can be running at both sites without connectivity to each other
9. There is no specific requirement for non-disruptive mobility between sites
10. Latency between sites cannot be guaranteed to be <10ms end to end

Alternatives

1. Stretched Cluster between “Datacenter A” and “Datacenter B”
2. Two independent vCloud deployments with no automated DR
3. Have more/less hosts at the DR site in the same configuration

Implications

1. Two (2) ESXi hosts in the vCloud Cluster located in “Datacenter B” will remain unused as “Hot Standby” unless there is a declared site failure at “Datacenter A”
2. Requires two (2) vCenter servers , one (1) per Datacenter
3. There will be no non-disruptive mobility between sites (ie: vMotion)
4. SRM protection groups / plans need to be created/managed Note: This will be done as part of the Production cluster
5. In the event of a DR event, only half the compute resources will be available compared to production.
6. Depending on the latency between sites, storage performance may be reduced by the synchronous replication as the write will not be acknowledged to the VM at “Datacenter A” until committed to the storage at “Datacenter B”

CloudXClogo