How to configure Network I/O Control (NIOC) for Nutanix (or any IP Storage)

This video shows how to configure Network I/O Control (NIOC) as per Nutanix Best Practices, however this configuration is also applicable to any IP based Storage.

For more information see the Nutanix vNetworking Best Practices Guide.

Related Articles:

1. Network I/O Control Shares/Limits for ESXi Host using IP Storage

2. Network I/O Control for ESXi Host using IP Storage (4x10Gb NICs)

3. Example VMware vNetworking Design for IP Storage

Example Architectural Decision – vMotion configuration for Cisco UCS

Problem Statement

In an environment where a customer has pre-purchased Cisco UCS to replace end of life equipment, what is the most suitable way to configure vMotion to make the most efficient use of the infrastructure?

Assumptions

1. vSphere 5.1 or greater
2. Two x 10GB Network interfaces per UCS Blade (Cisco Palo Adapters)
3. Core & Edge Network topology is in place using Cisco Nexus
4. Cisco Fabric Interconnects are in use

Motivation

1. Optimize performance for vMotion without impacting other traffic
2. Reduce complexity where possible
3. Minimize network traffic across the Nexus core

Architectural Decision

Two (2) vNICs will be presented from the Cisco fabric interconnect to each blade (ESXi Host) which will appear to the ESXi host as vmNIC0 and vmNIC1.

vNIC0 will be connected to “Fabric A” and vNIC1 will be connected to “Fabric B”.

The vMotion VMKernel (VMK) for each ESXi host will be configured on a vSwitch (or Distributed vSwitch) with two (2) 10GB Network adapters with vmNIC0 as “Active” and vmNIC1 as “Standby”.

Fabric failover will not be enabled in the fabric interconnect.

vmNIC Failback at the vSphere layer will be disabled.

Justification

1. Under normal circumstances vMotion traffic will only traverse Fabric A and will not impact Fabric B or the core network thus it will minimize the north-south traffic.
2. In the event that Fabric A suffers a failure of any kind, the VMK for vMotion will failover to the standby vNIC (vmNIC1) which will result in the same optimal configuration as traffic will only traverse Fabric B and not the core network thus it will minimizing the north-south traffic.
3. The failover is being handled by vSphere at the software layer which removes the requirement for fabric failover to be enabled. This allows a vSphere administrator to have visibility of the status of the networking without going into the UCS Manager.
4. The operational complexity is reduced
5. The solution is self healing at the UCS layer and this is transparent to the vSphere environment
6. At the vSphere layer, failback is not required as using Fabric B for all VMK vMotion traffic is still optimal. In the event Fabric B fails, the environment can failback automatically to Fabric A.

Implications

1. Initial setup has a small amount of additional complexity however this is a one time task (Set & Forget)
2. vNIC0 and vNIC1 need to be manually configured to Fabric A and Fabric B at the Cisco Fabric Interconnect via UCS manager however this is also a one time task (Set & Forget)

Alternatives

1. Use Route Based on Physical NIC Load and have VMK for vMotion managed automatically by LBT
2. Use vPC and Route based on IP Hash for all vSwitch traffic (including vMotion VMK)
3. Use the Fabric Failover option at the UCS layer using a single vNIC presented to ESXi for all traffic
4. Use the Fabric Failover option at the UCS layer using two vNICs presented to ESXi for all traffic – Each vNIC would be pinned to a single Fabric (A or B)

Thank you to Prasenjit Sarkar (@stretchcloud) for Co-authoring this Example Architectural Decision.

Related Articles

1. Trade-off factor – Cisco UCS Fabric Failover OR OS based NIC teaming using dual fabric (Stretch-cloud – By Prasenjit Sarkar @stretchcloud)
2 . Why You Should Pin vMotion Port Groups In Converged Environments (By Chris Wahl @ChrisWahl)

Example Architectural Decision – Host Isolation Response for a Nutanix Environment

Problem Statement

What are the most suitable HA / host isolation response when using Nutanix?

Assumptions

1. vSphere 5.0 or greater
2. Two x 10GB Network interfaces are shared for Nutanix Storage Traffic and Virtual Machine Traffic

Motivation

1. Minimize the chance of a false positive isolation response
2. Ensure in the event the storage is unavailable that virtual machines are promptly shutdown to enable HA to recover the VMs in a timely manner (where other hosts are unaffected by isolation) and to prevent a “split brain” scenario
3. Ensure maximum availability

Architectural Decision

Turn off the default isolation address and configure the below specified isolation addresses, which check connectivity to multiple Nutanix Controller VMs (CVMs) on the IP Storage VLAN.

Configure the following Isolation addresses

das.isolationaddress1 : NDFS Cluster IP Address

Configure Host Isolation Response to: Power Off

For Nutanix Controller VMs override the cluster setting and configure Host Isolation Response to “Leave Powered On”

Justification

1. The ESXi Management traffic along with the Virtual machine traffic and inter-Nutanix node storage traffic is running over 2 x 10GB connections. Using the ESXi management gateway (default isolation address) to check for isolation is not suitable as the management network can be offline without impacting the IP storage or data networks. This situation could lead to false positives isolation responses.
2. The isolation addresses chosen tests IP storage connectivity over the converged 10Gb network and in the event this is unavailable, there is no point testing further connectivity as Virtual machines cannot function without their storage
3. In the event the Nutanix cluster IP address cannot be reached by ICMP the Node will not be able to properly function. As such, triggering isolation response and powering off the VMs based on this criteria is logical as the VMs will not be able to function under these conditions.
4. In the event the NDFS Cluster IP address does not respond to ICMP on the Management interfaces it is likely there has been an isolation event OR a catastrophic failure in the environment, either to the network, or the storage controllers themselves, in which case the safest option is to Power Off the VMs.
5. In the event the isolation response is triggered and the isolation does not impact all hosts within the cluster, the VMs can be restarted by HA onto a surviving host and resume functioning
6. Using the Nutanix Controller VM (CVM) IP address (192.168.5.2) for the Isolation address is not suitable as this address exists on each ESXi hosts and as such it is not a good candidate for isolation detection as the host will always be able to find this address even when the network is offline due to the CVM being local to the host
7. The Nutanix Controller VM accesses local storage and can continue to run locally even in an isolation event. When the isolated event is over, the CVM will then regain connectivity to the other CVMs in the Nutanix cluster.
8. Shutting down the CVM would only increase the recovery time once the isolation even is over and has no added benefits.

Implications

1. In the event the host cannot reach any of the isolation addresses, virtual machines will be powered off.
2. Initial cluster setup would require the vSphere administrator to override the Cluster settings for each Controller VM. Note: This is a one time task (Set & Forget)

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address
4. Leave the CVM on the default cluster setting and “Shutdown” on isolation

Related Articles

1. VMware Host Isolation Response in a Nutanix Environment #NoSAN

2. Storage DRS and Nutanix – To use, or not to use, that is the question?

3. VMware HA and IP Storage