Virtual machines lose network connectivity after HA fail over. (ESXi 5.5)

If you’re running ESXi 5.5 pre Release 2068190 (Update 2) and have not upgraded from GA release or are avoided upgrading due to the NFS issue in Update 1 and are running on a vSphere Distributed Switch (VDS) then read on:

When configured with Static Port Binding, which is the default and recommended port binding setting (See VMware KB: 1022312) after a HA event, a problem has been discovered which prevents VMs from connecting to the network and the VM starts up with its vmNIC in a “Disconnected” state.

You may receive the following error when trying to reconnect the vmNIC.

DVSfailure2

If you are having this problem you can connect the VMs to a dvPortGroup which uses Ephemeral Binding to get the environment online. This could be used for VMs such as Domain Controllers and vCenter (and its dependencies) then once these VMs are online, this will allow all other VMs to connect to the network normally.

Once everything is back online, I recommend you connect all VMs back to their original dvPortGroup/s with Static Binding.

If you don’t have a dvPortGroup with Ephemeral binding, create a Standard vSwitch and connect a single NIC to it, follow the same process, then migrate the VMs back to the dvPortGroup once they are online and return the pNIC to the dvSwitch.

To future proof the environment, you may choose to create a dvPortGroup for the Infrastructure VMs and use Ephemeral Binding, or just have a dvPortGroup with Ephemeral binding ready to use just in case.

The good news is VMware have already resolved this issue in ESXi Update 2 (Release: 2068190) so I recommend bypassing Update 1 (due to the NFS bug) and going straight to Update 2 which means you will avoid the issue altogether.

Fight the FUD: vCenter on VDS on Nutanix NFS Datastore – Not a problem!

I saw this tweet (below) and was inspired to write this post as is appears there is still a clear misunderstanding of how the VMware Virtual Distributed Switch (VDS) functions when vCenter is down.

tweetmanish

My interpretation was the tweet was suggesting/implying the following:

1. If vCenter (VC) is on a VDS there is a problem in the event of an outage

2. Having vCenter (VC) running on an NFS datastore is a problem

3. Nutanix environments have problems with VDS deployments

4. In the event of an outage where vCenter (VC) is on a VDS and the underlying storage is presented via NFS by Nutanix, that this is somehow worse than if the storage was presented by another storage vendor.

Long story short, none of the above are problems and the author of the tweet is simply mistaken.

I highly recommend watching this recording of a VMworld session by @chriswahl (VCDX#104) & @thejasonnash (VCDX#49) which covers Distributed Switches in depth.

NET2745 – vSphere Distributed Switch: Technical Deep Dive

Here is a Video showing how a Nutanix environment recovers with vCenter offline with everything including the Nutanix CVMs connected to a VDS.

In the video, the Nutanix controller VM is using a dvPortGroup with Ephemeral Binding, however Static Binding is also fully supported.

So we don’t need to imagine an outage, the above shows the process start to finish and its only a few minutes to be fully operational!

No FUD!

bullshitrefute

Related Articles:

1. Example Architectural Decision – Port Binding Setting for a dvPortGroup
2. Distributed vSwitches and vCenter outage, what’s the deal?@duncanyb (VCDX #007)

Example Architectural Decision – Port Binding Setting for a dvPortGroup

Problem Statement

In a VMware vSphere environment using Virtual Distributed Switches (VDS) where all VMs including vCenter is hosted on the VDS, What is the most suitable Port Binding setting for dvPortgroups to ensure maximum performance and availability?

Assumptions

1. Enterprise Plus Licensing
2. vCenter is hosted on the VDS

Requirements

1. The environment must have central management of vNetworking
2. All VMs must be able to be powered on in the event of a vCenter outage
3. Network connectivity must not be impacted if vCenter is down.

Motivation

1. Reduce complexity where possible.
2. Maximize the availability of the infrastructure

Architectural Decision

Use the default dvPortGroup Port Binding setting of “Static Binding”

Justification

1. A dvPortGroup port is assigned to a VM and reserved when a VM is connected to the dvPortGroup. This ensures connectivity at all times including when vCenter is down.
3. Using “Static Binding” ensures the vCenter VM can be powered on and connected to the dvPortGroup even after a failure/outage.
4. “Static Binding” is the default setting and there is no reason to modify this setting.

Implications

1. The number of VMs supported on the dvPortGroup / VDS is limited to the number ports on the VDS (not overcommitment of ports is possible).
2. Number of ports configured on a dvPortGroup should be greater than the maximum number of VMs required to be supported.
3. Port Allocation should be left at the default of “Elastic” to ensure the number of ports is automatically expanded if/when required.
4. New Virtual machines cannot be powered on and connected to a dvPortGroup (VDS) when vCenter is down.

Alternatives

1. Set dvPortGroup Port Binding to “Dynamic binding”
2. Set dvPortGroup Port Binding to “Ephemeral binding”

Related Articles

1. Distributed vSwitches and vCenter outage, what’s the deal? – @duncanyb (VCDX #007)

2. Choosing a port binding type in ESX/ESXi (1022312)