Nutanix Data Protection Capabilities

There is a lot of misinformation being spread in the HCI space about Nutanix data protection capabilities. One such example (below) was published recently on InfoStore.

Evaluating Data Protection for Hyperconverged Infrastructure

When I see articles like this, It really makes me wonder about the accuracy of content on these type of website as it seems articles are published without so much as a brief fact check from InfoStore.

None the less, I am writing this post to confirm what Data Protection Capabilities Nutanix provides.

  • Native In-Built Data protection

Prior to my joining Nutanix in mid-2013, Nutanix already provided a Hypervisor agnostic Integrated backup and disaster recovery solution with centralised consumer- grade management through our PRISM GUI which is HTML 5 based.

The built in capabilties are flexible and VM-centric policies to protect virtualized applications with different RPOs and RTOs with or without application consistency.

The solution also supports Local, remote, and cloud-based backups, and synchronous and asynchronous replication-based disaster recovery solutions.

Currently supported cloud targets include AWS and Azure as shown below.

CloudBackup

The below video which shows in real time how to create Application consistent snapshots from the Nutanix PRISM GUI.

Nutanix can also perform One to One, One to Many and Many to One replication of application consistent snapshots to onsite or offsite Nutanix clusters as well as Cloud providers (AWS/Azure), ensuring choice and flexibility for customers.

Nutanix native data protection can also replicate between and recover VMs to clusters of different hypervisors.

  • CommVault Intellisnap Integration

Nutanix also provides integration with Commvault Intellisnap which allows existing Commvault customers to continue leveraging their investment in the market leading data protection product and to take advantage of other features where required.

The below shows how agentless backups of Virtual Machines is supported with Acropolis Hypervisor (AHV). Note: Commvault is also fully supported with Hyper-V and ESXi.

By Commvault directly calling the Nutanix Distributed Storage Fabric (NDSF) it ensures snapshots are taken quickly and efficiently without the dependancy on a hypervisor.

  • Hypervisor specific support such as VMware API Data Protection (VADP)

Nutanix also supports solutions which leverage VADP, allowing customers with existing investment in products such as Veeam & Netbackup to continue with their existing strategy until such time as they want to migrate to Nutanix native data protection or solutions such as Commvault.

  • In-Guest Agents

Nutanix supports the use of In-Guest agents which are typically very inefficient with centralised SAN/NAS storage but due to data locality and NDSF being a truly distributed platform, In-Guest Incremental forever backups perform extremely well on Nutanix as the traditional choke points such as Network, Storage Controllers & RAID packs have been eliminated.

Summary:

As one size does not fit all in the world of I.T, Nutanix provides customers choice to meet a wide range of market segments and requirements with strong native data protection capabilities as well as 3rd party integration.

Example Architectural Decision – Virtual Switch Load Balancing Policy

Problem Statement

What is the most suitable network adapter load balancing policy to be configured on the vSwitch & dvSwitch/es where 10Gb adapters are being used for dvSwitches and 1Gb for vSwitch which is only used for ESXi management traffic?

Assumptions

1. vSphere 4.1 or later

Motivation

1. Ensure optimal performance and redundancy for the network
2. Simplify the solution without compromising performance for functionality

Architectural Decision

Use “Route based on physical NIC load” for Distributed Virtual switches and “Route based on originating port ID” for vSwitches.

Justification

1. Route based on physical NIC load achieves both availability and performance
2. Requires only a basic switch configuration (802.1q and the required VLANs tagged)
3. Where a single pNIC’s utilization exceeds 75% the “route based on physical NIC load” will dynamically balance workloads to ensure the best possible performance

Implications

1. If NFS IP storage is used with a single VMKernel it will not use both connections concurrently. If using multiple 10GB connections for NFS traffic is required then two or more VLANs should be created with one VMK per VLAN. If only one VMK is used, the only option if you want traffic to go down multiple uplinks would be to use “Route based on IP hash” and have Etherchannel configured on the physical switch.

Alternatives

1. Route based on the originating port ID

Pros: Chooses an uplink based on the virtual port where the traffic entered the virtual switch. The virtual machine outbound traffic is mapped to a specific physical NIC based on the ID of the virtual port to which this virtual machine is connected. This method is simple and fast, and does not require the VMkernel to examine the frame for necessary information.

Cons: When the load is distributed in the NIC team using the port-based method, no virtual machine single-NIC will ever get more bandwidth than can be provided by a single physical adapter.

2. Route based on IP hash.

Pros: Chooses an uplink based on a hash of the source and destination IP addresses of each packet. For non-IP packets, whatever is at those offsets is used to compute the hash. In this method, a NIC for each outbound packet is chosen based on its source and destination IP address. This method has a better distribution of traffic across physical NICs.

When the load is distributed in the NIC team using the IP-based method, a virtual machine single-NIC might use the bandwidth of multiple physical adapters.

Cons: This method has higher CPU overhead and is not compatible with all switches (it requires IEEE 802.3ad link aggregation support).

3. Route based on source MAC hash

Pros: Chooses an uplink based on a hash of the source Ethernet. This method is compatible with all physical switches. The virtual machine outbound traffic is mapped to a specific physical NIC based on the virtual NIC’s MAC address.

Cons: This method has low overhead, and might not spread traffic evenly across the physical NICs.

When the load is distributed in the NIC team using the MAC-based method, no virtual machine single-NIC will ever get more bandwidth than can be provided by a single physical adapter.

4. Use explicit fail-over order

Pros: Always uses the highest order uplink from the list of Active adapters which passes failover detection criteria.

Cons: This setting is equivalent to a fail over policy and is not strictly a load balancing policy.

5. Route based on Physical NIC load

Pros: Most efficient load balancing mechanism because it is base on the actual physical NIC workload.

Cons: Not available on standard vSwitches

For further information on the topic checkout the below two articles by a couple of very knowledgable VCDX’s

Michael Webster – Etherchanneling or Load based teaming?
Frank Denneman – IP Hash verses LBT

Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage

Problem Statement

With 10GB connections, the proposed ESXi hosts will have less physical connections, but more bandwidth per connection than a host with 1GB NICs. In this case, 4 x 10GB NICs needs to cater for all traffic (including IP storage) for the ESXi hosts.

The design needs to ensure all types of traffic have sufficient burst and sustained bandwidth without negatively impacting other types of traffic.

How can this be achieved?

Assumptions

1. No additional Network cards (1gb or 10gb) can be supports
2. vSphere 5.0 or later
3. 2 x 48 port 10GB and 2 x 48 port 1GB switches exist in the environment
4. ESXi host are 4 way servers with 512GB RAM which are expected to run large numbers of VMs with varying workloads
5. Multi-NIC vMotion is not required due to using 10Gb NICs

Motivation

1.When using bandwidth allocation, use “shares” instead of “limits,” as the former has greater flexibility for unused capacity redistribution.
2. Ensure IP Storage (NFS) performance is optimal
3.Ensure vMotion activities (including a host entering maintenance mode) can be performed in a timely manner without impact to IP Storage or Fault Tolerance
4. Fault tolerance is a latency-sensitive traffic flow, so it is recommended to always set the corresponding resource-pool shares to a reasonably high relative value in the case of custom shares.

Architectural Decision

Separate VMware infrastructure functions (VMKernel) from virtual machine network traffic by creating two (2) dvSwitches (each with 2 x 10GB connections), dvSwitch-Admin and dvSwitch-Data

Enable Network I/O control, and configure NFS and/or iSCSI traffic with a share value of 100 and vMotion & FT which will have share value of 25.

Configure the two (2) VMKernel’s for IP Storage on dvSwitch-Admin and set to be Active on one 10GB interface and Standby on the second.

Configure the VMKernel for vMotion on dvSwitch-Admin as Active on one interface and standby on the second and vice-versa for FT.

Configure all dvPortGroups for Virtual Machine data on dvSwitch-Data.

Justification

1. The share values were chosen to ensure storage traffic is not impacted as this can cause flow on effects for the environments performance. vMotion & FT are considered important, but during periods of contention, should not monopolize or impact IP storage traffic.
2. IP Storage is more critical to ongoing cluster and VM performance than vMotion or FT
3. IP storage requires higher priority than vMotion which is more of a burst activity and is not as critical to VM performance
4. Which a share value of 25,  Fault Tolerance still has ample bandwidth to support the maximum supported FT machines per host of 4 even during periods of contention
5. Which a share value of 25, vMotion still has ample bandwidth to support multiple concurrent vMotion’s during contention however performance should not be impacted on a day to day basis. With up to 8 vMotion’s supported as it is configured on a 10GB interface. (Limit of 4 on a 1GB interface)
6. The environment required 1GB switches to accommodate for various devices, such as Out of Band management & IP KVM devices, as such having ESXi management on 2 x 1GB ports was not adding significant cost to the solution

Implications

1. In the unlikely event of significant and ongoing contention, performance for vMotion and FT may affect the ability to perform the evacuation of a host in a timely manner. This may impact the ability to performance scheduled maintenance.

Alternatives

1. Use all 4 x 10Gb NICs on a single dvSwitch, and use “Active” and “Standby” to ensure traffic remained on a specified NIC unless there was a failure. Leverage Network I/O control similar to the above example to ensure minimal impact of contention

See Example VMware vNetworking Design for IP Storage for an overview of the vNetworking design described in this example.