Example Architectural Decision – vSphere configuration for handling APD/PDL scenarios

Problem Statement

What is the best way to configure the vSphere environment to handle All Paths Down (APD) and Permanent Device Loss (PDL) situations where the environment uses Active/Active (IBM SVC) storage with FC connectivity via a dedicated highly available Storage Area Network (SAN) fabric?

Requirements

1. Ensure in the event of storage issues the impact to the vSphere environment is minimized.
2. Where possible have the environment automatically respond in the event of storage problems

Assumptions

1. vSphere 5.1 or later
2. The Storage Area Network (SAN)  fabric is highly available (>99.999% availability)
3. All storage is FC (block) based via an Active/Active Disk array (IBM SVC disk system)
4. All ESXi hosts have storage connectivity via multiple HBAs
5. All ESXi hosts are connected to two (2) physically separate FC switches
6. The Path Selection Plugin (PSP) being used is “VMW_PSP_RR” (Round Robin)

Constraints

1. None

Motivation

1. Minimize impact of APD and PDL situations

Architectural Decision

Configure the following advanced settings

Set “Misc.APDHandlingEnable” to 1 (0 is default which is Disabled)
Set “Misc.APDTimeout” to 20 (140 seconds is default)

Set “disk.terminateVMOnPDLDefault” to 1 (Enabled)
Set “das.maskCleanShutdownEnabled” to 1 (Enabled)

Justification

1. The storage array (IBM SVC) operates in an Active/Active manor where the Path Selection Plugin (PSP) is either “VMW_PSP_RR” (Round Robin), “VMW_PSP_MRU” (Most Recently Used) OR “VMW_PSP_FIXED_AP” (Note: Now included in VMW_PSP_FIXED in vSphere 5.1), in the event of one or more path failures, the PSP will handle this event and use a working path. Where an APD situation occurs in a highly available SAN fabric it is likely the issue is a catastrophic failure and it is ideal to terminate I/O as soon as possible. As such lowering the “Misc.APDTimeout” to 20 (minimum) allows for a short outage but does not allow the VM to continue attempting I/O where it cannot be committed to disk.

2. After 20 seconds, any I/O from the VMs will be “fast-failed” with a status of “No_Connect” to prevent “hostd” worker threads being exhausted and causing the “hostd” service to become hung thus increasing resiliency at the ESXi layer.

3. In the event not all hosts in the cluster are impacted by the PDL, HA can detect the PDL on one (or more) hosts and restart the virtual machines on one of the hosts in the cluster which do not have the PDL state on the datastore/s

  • 4. Having “disk.terminateVMOnPDLDefault” enabled , ensures VMs are shutdown in a PDL event
  • 5.

  • The “das.maskCleanShutdownEnabled” setting allows VMs shutdown as a result of a PDL to be automatically restarted by HA

5. Setting the Misc.APDTimeout to “20” does not impact the storage connectivity even in the event of a single SVC cluster node failing as all Storage is Active on all SVC cluster nodes. Note: Half the paths would be lost in the event of a failed SVC cluster node but this does not constitute an APD situation.

Alternatives

1. Leave “Misc.APDHandlingEnable” at 0 (default)
2. Leave “Misc.APDTimeout” at 140 (default) OR set a higher or lower value (20 Min / 99999 Max)
3. Set “das.maskCleanShutdownEnabled” to Disabled
4. Set “disk.terminateVMOnPDLDefault” to 0 (Disabled)
5. Various combinations of the above

Implications

1. After 20 seconds, any I/O from the VMs will be “fast-failed” with a status of “No_Connect”., in the unlikely event of an outage lasting >20 seconds manual intervention will be required.
2. In the event of APD situation, Virtual machines will not be restarted by HA even where other ESXi hosts are not impacted by the APD situation
3. Due to the nature of an APD situation, there is no clean way to recover. Once the issue is resolved at the SAN fabric or disk system layer, ESXi hosts may need to be rebooted.

Related Articles

1. Advanced Configuration options for VMware High Availability in vSphere 5.0 and 5.1 (2033250)

CloudXClogo