Example Architectural Decision – VMware DRS automation level for a Nutanix environment

Problem Statement

What is the most suitable DRS automation level and migration threshold for a vSphere cluster running on Nutanix?

Requirements

1. Ensure optimal performance for Business Critical Applications
2. Minimize complexity where possible

Assumptions

1. Workload types and size are unpredictable and workloads may vary greatly and without notice
2. The solution needs to be as automated as possible without introducing significant risk
3. vSphere 5.0 or later

Constraints

1. 2 x 10GB NICs per ESXi host (Nutanix node)

Motivation

1. Prevent unnecessary vMotion migrations which will impact host & cluster performance
2. Ensure the cluster standard deviation is minimal
3. Reduce administrative overhead of reviewing and approving DRS recommendations
4. Ensure optimal storage performance

Architectural Decision

Use DRS in Fully Automated mode with setting “3” – Apply priority 1,2 and 3 recommendations

Create a DRS “Should run on hosts in group” rule for each Business Critical Applications (BCAs) and configure each BCA to run on a single specified host (ensuring BCA’s are separated or grouped according to workload)

DRS Automation will be Disabled for all Controller VMs (CVMs)

Justification

1. Fully Automated DRS prevents excessive vMotion migrations that do not provide significant compute benefits to cluster balance as the vMotion itself will use cluster & network resources

2. Ensure the Nutanix Distributed File System , specifically the “Curator” component does not need to frequently relocate data between Nutanix nodes (ESXi hosts) direct attached storage to ensure virtual machine/s have local access to data. Doing so would put additional load on the Controller VM (and Curator service), local/remote storage and the network.

2. Ensure cluster remains in a reasonably load balanced state without resource being wasted on load balancing the compute layer to only achieve minimal improvement which may impact the storage/network layer/s.

3. Applying Level 1,2 and 3 recommendations means recommendations that must be followed to satisfy cluster constraints, such as affinity rules and host maintenance will be applied (Level 1) as well as applying recommendations with four or more stars (Level 2) that promise a significant improvement in the cluster’s load balance. In the event significant improvement to the clusters load balance will be achieved, movement of data at the storage layer (via the CVM / Network) can be justified

3. DRS is a low risk, proven technology which has been used in large production environments for many years

4. Setting DRS to manual would be a significant administrative (BAU) overhead and introduce additional risks such as human error and situations where contention may go unnoticed which may impact performance of one or more VMs

5. Setting a more aggressive DRS migration threshold may put an additional load on the cluster which will likely not result in significantly better cluster balance (or VM performance) and could result in significant additional workload for the ESXi hosts (compute layer), the Nutanix Controller VM (CVM) ,network & underlying storage.

6. By using DRS “Should run on this host” rules for Business Critical Applications (BCAs) will ensure consistent performance for these workloads (by keeping VMs on the same ESXi host/Nutanix node where its data is local) without introducing significant complexity or limiting vSphere functionally

Implications

1. In some circumstances the DRS cluster may have a low level of imbalance

2. DRS will not move workloads via vMotion where only a moderate improvement to the cluster will be achieved

3. At times, including after performing updates (via VUM) of ESXi hosts (Nutanix Nodes) the cluster may appear to be unevenly balanced as DRS may calculate minimal benefit from migrations. Setting DRS to “Use Fully automated and Migration threshold 3” for a short period of time following maintenance should result in a more evenly balanced DRS cluster with minimal (short term) increased workload for the Nutanix Controller VM (CVM) , network & underlying storage.

4. DRS rules will need to be created for Business Critical Applications

Alternatives

1.Use Fully automated and Migration threshold 1 – Apply priority 1 recommendations
2.Use Fully automated and Migration threshold 3 – Apply priority 1,2 recommendations
3. Use Fully automated and Migration threshold 4- Apply priority 1,2,3 and 4 recommendations
4.Use Fully automated and Migration threshold 5- Apply priority 1,2,3,4 & 5 recommendations
5. Set DRS to manual and have a VMware administrator assess and apply recommendations
6. Set DRS to “Partially automated”

Related Articles

1. Storage DRS and Nutanix – To use or not to use, That is the question

Storage DRS Configuration – Architectural Decision making flowchart

I was speaking to a number of people recently, who were trying to come up with a one size fits all Storage DRS configuration for a reference architecture document.

As Storage DRS is a reasonably complicated feature, it was my opinion that a one size fits all would not be suitable, and that multiple examples should be provided when writing a reference architecture.

A collegue suggested a flowchart would assist in making the right decision around Storage DRS, so I took up the challenge to put one together.

The below is my version 0.1 of the flowchart, which I thought I would post and hopefully get some good feedback from the community, and create a good guide for those who may not have the in-depth knowledge or experience, too choose what should be in most cases an appropriate configuration for SDRS.

This also compliments some of my previous example architectural decisions which are shown in the related topic section below.

As always, feedback is always welcomed.

I hope you find this helpful.

* Updated to include the previously missing “NO” option for Data replication.

SDRS flowchart V0.2

Related Articles

1. Example Architectural Decision – Storage DRS configuration for NFS datastores

2. Example Architectural Decision – Storage DRS configuration for VMFS datastores

Example Architectural Decision – DRS Automation Level

Problem Statement

What is the most suitable DRS automation level and migration threshold for a vSphere cluster running an IaaS offering with a self service portal w/ unpredictable workloads?

Assumptions

1. Workload types and size are unpredictable in a IaaS environment, workloads may vary greatly and without notice
2. The solution needs to be as automated as possible without introducing significant risk

Motivation

1. Prevent unnecessary vMotion migrations which will impact host & cluster performance
2.Ensure the cluster standard deviation is minimal
3. Reduce administrative overhead of reviewing and approving DRS recommendations

Alternatives

1.Use Fully automated and Migration threshold 1 – Apply priority 1 recommendations
2.Use Fully automated and Migration threshold 2- Apply priority 1 & 2 recommendations
3. Use Fully automated and Migration threshold 4- Apply priority 1,2,3 and 4 recommendations
4.Use Fully automated and Migration threshold 5- Apply priority 1,2,3,4 & 5 recommendations
5. Set DRS to manual and have a VMware administrator assess and apply recommendations

Justification

1. Prevent excessive vMotion migrations that do not provide significant benefit to cluster balance as the vMotion itself will use cluster and network resources
2. Ensure cluster remains in a reasonably load balanced state without resource being wasted on load balancing for minimal improvement
3. DRS is a low risk, proven technology which has been used in large production environments for many years
4. Setting DRS to manual would be a significant administrative overhead and introduce additional risk from human error
5. Setting a more aggressive DRS migration threshold would put an additional load on the cluster which will likely not result in significantly better balance

Architectural Decision

Use DRS in Fully Automated mode with setting “3” – Apply priority 1,2 and 3 recommendations

Implications

1. DRS will not move workloads via vMotion where only a moderate improvement to the cluster will be achieved
2. At times, including after performing updates (via VUM) of ESXi hosts the cluster may appear to be unevenly balanced as DRS may calculate minimal benefit from migrations. Setting DRS to “Use Fully automated and Migration threshold 5” for a short period of time following maintenance should result in a more evenly balanced cluster.