Example Architectural Decision – Host Isolation Response for a Nutanix Environment

Problem Statement

What are the most suitable HA / host isolation response when using Nutanix?

Assumptions

1. vSphere 5.0 or greater
2. Two x 10GB Network interfaces are shared for Nutanix Storage Traffic and Virtual Machine Traffic

Motivation

1. Minimize the chance of a false positive isolation response
2. Ensure in the event the storage is unavailable that virtual machines are promptly shutdown to enable HA to recover the VMs in a timely manner (where other hosts are unaffected by isolation) and to prevent a “split brain” scenario
3. Ensure maximum availability

Architectural Decision

Turn off the default isolation address and configure the below specified isolation addresses, which check connectivity to multiple Nutanix Controller VMs (CVMs) on the IP Storage VLAN.

Configure the following Isolation addresses

das.isolationaddress1 : NDFS Cluster IP Address

Configure Host Isolation Response to: Power Off

For Nutanix Controller VMs override the cluster setting and configure Host Isolation Response to “Leave Powered On”

Justification

1. The ESXi Management traffic along with the Virtual machine traffic and inter-Nutanix node storage traffic is running over 2 x 10GB connections. Using the ESXi management gateway (default isolation address) to check for isolation is not suitable as the management network can be offline without impacting the IP storage or data networks. This situation could lead to false positives isolation responses.
2. The isolation addresses chosen tests IP storage connectivity over the converged 10Gb network and in the event this is unavailable, there is no point testing further connectivity as Virtual machines cannot function without their storage
3. In the event the Nutanix cluster IP address cannot be reached by ICMP the Node will not be able to properly function. As such, triggering isolation response and powering off the VMs based on this criteria is logical as the VMs will not be able to function under these conditions.
4. In the event the NDFS Cluster IP address does not respond to ICMP on the Management interfaces it is likely there has been an isolation event OR a catastrophic failure in the environment, either to the network, or the storage controllers themselves, in which case the safest option is to Power Off the VMs.
5. In the event the isolation response is triggered and the isolation does not impact all hosts within the cluster, the VMs can be restarted by HA onto a surviving host and resume functioning
6. Using the Nutanix Controller VM (CVM) IP address (192.168.5.2) for the Isolation address is not suitable as this address exists on each ESXi hosts and as such it is not a good candidate for isolation detection as the host will always be able to find this address even when the network is offline due to the CVM being local to the host
7. The Nutanix Controller VM accesses local storage and can continue to run locally even in an isolation event. When the isolated event is over, the CVM will then regain connectivity to the other CVMs in the Nutanix cluster.
8. Shutting down the CVM would only increase the recovery time once the isolation even is over and has no added benefits.

Implications

1. In the event the host cannot reach any of the isolation addresses, virtual machines will be powered off.
2. Initial cluster setup would require the vSphere administrator to override the Cluster settings for each Controller VM. Note: This is a one time task (Set & Forget)

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address
4. Leave the CVM on the default cluster setting and “Shutdown” on isolation

Related Articles

1. VMware Host Isolation Response in a Nutanix Environment #NoSAN

2. Storage DRS and Nutanix – To use, or not to use, that is the question?

3. VMware HA and IP Storage

Example VMware vNetworking Design w/ 2 x 10GB NICs (IP based or FC/FCoE Storage)

I have had a large response to my earlier example vNetworking design with 4 x 10GB NICs, and I have been asked, “What if I only have 2 x 10GB NICs”, so the below is an example of an environment which was limited to just two (2) x 10GB NICs and used IP Storage.

If your environment uses FC/FCoE storage, the below still applies and the IP storage components can simply be ignored.

Requirements

1. Provide high performance and redundant access to the IP Storage (if required)
2. Ensure ESXi hosts could be evacuated in a timely manner for maintenance
3. Prevent significant impact to storage performance by vMotion / Fault Tolerance and Virtual machines traffic
4. Ensure high availability for all network traffic

Constraints

1. Two (2) x 10GB NICs

Solution

Use one dvSwitch to support all VMKernel and virtual machine network traffic and use “Route based of Physical NIC Load” (commonly refereed to as “Load Based teaming”).

Use Network I/O control to ensure in the event of contention that all traffic get appropriate network resources.

Configure the following Network Share Values

IP Storage traffic : 100
ESXi Management: 25
vMotion: 25
Fault Tolerance : 25
Virtual Machine traffic : 50

Configure two (2) VMKernel’s for IP Storage and set each on a different VLAN and Subnet.

Configure VMKernels for vMotion (or Multi-NIC vMotion), ESXi Management and Fault Tolerance and set to active on both 10GB interfaces (default configuration).

All dvPortGroups for Virtual machine traffic (in this example VLANs 6 through 8) will be active on both interfaces.

The above utilizes LBT to load balance network traffic which will dynamically move workload between the two 10GB NICs once one or both network adapters reach >=75% utilization.

vNetworking BLOG 2x10gb

Conclusion

Even when your ESXi hosts only have two x 10Gb interfaces, VMware provides enterprise grade features to ensure all traffic (including IP Storage) can get access to sufficient bandwidth to continue serving production workloads until the contention subsides.

This design ensures that in the event a host needs to be evacuated, even during production hours, that it will complete in the fastest possible time with minimal or no impact to production. The faster your vMotion activity completes, the sooner DRS can get your cluster running as smoothly as possible, and in the event you are patching, the sooner your maintenance can be completed and the hosts being patched are returned to the cluster to serve your VMs.

Related Posts

1. Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage (4 x 10 GB NICs)
2. Network I/O Control Shares/Limits for ESXi Host using IP Storage

Example VMware vNetworking Design for IP Storage

On a regular basis, I am being asked how to configure vNetworking to support environments using IP Storage (NFS / iSCSI).

The short answer is, as always, it depends on your requirements, but the below is an example of a solution I designed in the past.

Requirements

1. Provide high performance and redundant access to the IP Storage (in this case it was NFS)
2. Ensure ESXi hosts could be evacuated in a timely manner for maintenance
3. Prevent significant impact to storage performance by vMotion / Fault Tolerance and Virtual machines traffic
4. Ensure high availability for ESXi Management / VMKernel and Virtual Machine network traffic

Constraints

1. Four (4) x 10GB NICs
2. Six (6) x 1Gb NICs (Two onboard NICs and a quad port NIC)

Note: So in my opinion the above NICs are hardly “constraining” but still important to mention.

Solution

Use a standard vSwitch (vSwitch0) for ESXi Management VMKernel. Configure vmNIC0 (Onboard NIC 1) and vmNIC2 (Quad Port NIC – port 1)

ESXi Management will be Active on vmNIC0 and vmNIC2 although it will only use one path at any given time.

Use a Distributed Virtual Switch (dvSwitch-admin) for IP Storage , vMotion and Fault Tolerance.

Configure vmNIC6 (10Gb Virtual Fabric Adapter NIC 1 Port 1) and vmNIC9 (10Gb Virtual Fabric Adapter NIC 2 Port 2)

Configure Network I/O with NFS traffic having a share value of 100 and vMotion & FT will each have share value of 25

Each VMKernel for NFS will be active on one NIC and standby on the other.

vMotion will be Active on vmNIC6 and Standby on vmNIC9 and Fault Tolerance vice versa.

vNetworking Example dvSwitch-Admin

Use a Distributed Virtual Switch (dvSwitch-data) for Virtual Machine traffic

Configure vmNIC7 (10Gb Virtual Fabric Adapter NIC 1 Port 2) and vmNIC8 (10Gb Virtual Fabric Adapter NIC 2 Port 1)

Conclusion

While there are many ways to configure vNetworking, and there may be more efficient ways to achieve the requirements set out in this example, I believe the above configuration achieves all the customer requirements.

For example, it provides high performance and redundant access to the IP Storage by using two (2)  VMKernel’s each active on one 10Gb NIC.

IP storage will not be significantly impacted during periods of contention as Network I/O control will ensure in the event of contention that the IP Storage traffic has ~66% of the available bandwidth.

ESXi hosts will be able to be evacuated in a timely manner for maintenance as

1. vMotion is active on a 10Gb NIC, thus supporting the maximum 8 concurrent vMotion’s
2. In the event of contention, worst case scenario vMotion will receive just short of 2GB of bandwidth. (~1750Mb/sec)

High availability is ensured as each vSwitch and dvSwitch has two (2) connections from physically different NICs and connect to physically separate switches.

Hopefully you have found this example helpful and for a example Architectural Decision see Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage