management cluster

Problem Statement

What is the most suitable BC/DR solution for a vCloud director environment?

Requirements

1. Ensure the vCloud solution can tolerate a site failure in an automated manner
2. Ensure the vCloud solution meets/exceeds the RTO of 4hrs
3. Comply with all requirements of the Business Continuity Plan (BCP)
4. Solution must be a supported vSphere / vCloud Configuration
5. Ensure all features / functionality of the vCloud solution are available following a DR event

Assumptions

1. Datacenters are in an Active/Active configuration
2. Stretched Layer 2 network across both datacenters
3. Storage based replication between sites
4. vSphere 5.0 Enterprise Plus or later
5. VMware Site Recovery Manager 5.0 or later
6, vCloud Director 1.5 or later
7. There is no requirement for workloads proposed to be hosted in vCloud to be at one datacenter or another

Constraints

1. The hardware for the solution has already been chosen and purchased. 6 x 4 Way, 32 core Hosts w/ 512GB RAM and 4 x 10GB
2. The storage solution is already in place and does not support a Metro Storage Cluster (vMSC) configuration

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity

Architectural Decision

Use the vCloud DR solution as described in the “vCloud Director Infrastructure Resiliency Case Study” (By Duncan Epping @duncanyb and Chris Colotti @Ccolotti )

In Summary, Host the vSphere/vCloud Management virtual machines on an SRM protected cluster.

Use a dedicated cluster for vCloud compute resources.

Configure the vSphere cluster which is dedicated to providing compute resources to the vCloud environment (Provider virtual data center – PvDC) to have four (4) compute nodes located at Datacenter A for production use and two (2) compute nodes located at Datacenter B (in ”Maintenance mode”) dedicated to DR.

Storage will not be stretched across sites; LUNs will be presented locally from “Datacenter A” shared storage to the “Datacenter A” based hosts. The “Datacenter A” storage will be replicated synchronously to “Datacenter B” and presented from “Datacenter B” shared storage to the two (2) “Datacenter B” based hosts. (No stretched Storage between sites)

In the event of a failure, SRM will recover the vSphere/vCloud Management virtual machines bringing back online the Cloud, then a script as the last part of the SRM recovery plan, Mounts the replicated storage to the ESXi hosts in “Datacenter B” and takes the two (2) hosts at “Datacenter B” out of maintenance mode. HA will then detect the virtual machines and power on them on.

Justification

1. Stretched Clusters are more suited to Disaster Avoidance than Disaster Recovery
2. Avoids complex and manual intervention in the case of a disaster in the case of a stretched cluster solution
3. A Stretched cluster provides minimal control in the event of a Disaster where as in this case, HA simply restarts VMs once the storage is presented (automatically) and the hosts are taken out of Maintenance mode (also automated)
4. Having two (2) ESXi hosts for the vCloud resource cluster setup in “Datacenter B” in “Maintenance Mode” and the storage mirrored as discussed allows the virtual workloads to be recovered in an automated fashion as part of the VMware Site Recovery Manager solution.
5. Removes the management overhead of managing a strecthed cluster using features such as DRS affinity rules to keep VMs on the hosts on the same site as the storage
6. vSphere 5.1 backed resource clusters support >8 host clusters for “Fast provisioning”
7. Remove the dependency on the Metropolitan Area Data and Storage networks during BAU and the potential impact of the latency between sites on production workloads
8. Eliminates the chance of a “Split Brain” or a “Datacenter Partition” scenario where VM/s can be running at both sites without connectivity to each other
9. There is no specific requirement for non-disruptive mobility between sites
10. Latency between sites cannot be guaranteed to be <10ms end to end

Alternatives

1. Stretched Cluster between “Datacenter A” and “Datacenter B”
2. Two independent vCloud deployments with no automated DR
3. Have more/less hosts at the DR site in the same configuration

Implications

1. Two (2) ESXi hosts in the vCloud Cluster located in “Datacenter B” will remain unused as “Hot Standby” unless there is a declared site failure at “Datacenter A”
2. Requires two (2) vCenter servers , one (1) per Datacenter
3. There will be no non-disruptive mobility between sites (ie: vMotion)
4. SRM protection groups / plans need to be created/managed Note: This will be done as part of the Production cluster
5. In the event of a DR event, only half the compute resources will be available compared to production.
6. Depending on the latency between sites, storage performance may be reduced by the synchronous replication as the write will not be acknowledged to the VM at “Datacenter A” until committed to the storage at “Datacenter B”

Problem Statement

When designing a VMware View environment, there are numerous management virtual machines which are required to run the environment, including but not limited to Domain Controllers, vCenter , VUM , View Connection Brokers , View Security Servers, View Transfer servers , View Composer. These servers are typically heavily utilized in larger View deployments and in the event of compute or storage contention, would likely impact the performance of the Virtual Desktop Infrastructure, especially where View Composer or virtual desktop power or provisioning operations are frequent.

How can the VDI environment be designed so management servers have a consistent high level of performance and ensure that high consolidation ratios can be achieved for desktops whilst maintaining a consistent end user experience?

Assumptions

1. One or more VMware View “Blocks”
2. ~2000 Users per Block
3. Using VMware View Linked Clones
4. Target overcommitment for Virtual desktops vCPU is >=6:1 – This is a conservative overcommitment ratio, >10:1 can be achieved
5. Target overcommitment for Virtual desktops vRAM is >=1.5:1 – This is a reasonable overcommitment ratio, although higher can be achieved
6. vSphere 4.1 or later
7. VMware View 4.5 or later
8. ESXi Hosts are large enough to support >200 users each (eg: At least 2 way / 256GB assuming 1vCPU/1GB RAM VDI VMs)
9. An existing vSphere cluster supporting server workloads is not available or is at or near capacity
10. Antivirus has been optimized for Virtual desktop environments, such as vShield Endpoint to offload AV scanning to the hypervisor

Motivation

1. Ensure consistent & optimal performance for Virtual desktops and VMware View Infrastructure VMs
2. Achieve the best ROI for the solution

Architectural Decision

Create a three (3) node “Management Cluster” with a scale out approach using 2 Way servers (as opposed to Four way servers like the VMware View Blocks) to ensure lower HA overhead (33% for N+1) and higher DRS efficiency than a two (2) node cluster. Have management virtual machines use different underlying storage, being either dedicated RAID packs or aggregates or for a large environments, storage controllers. Have a vCenter dedicated to running the Management infrastructure.

Justification

1. The CPU overcommitment ratio for Virtual desktops is generally much higher than for server workloads
2. Server workloads are less tolerant to high CPU overcommitment ratios than virtual desktops
3. CPU contention (a.k.a CPU Ready) will likely have significant impact on infrastructure VMs
4. If Management VMs we’re hosted within the VMware View Blocks, the overcommitment would have to be lower to enable adequate performance, thus reducing the ROI for the solution
5. Server and desktop workloads have very different compute and storage profiles and generally are not good candidates to share the same ESXi host or cluster
6. During VMware View Linked Clone deployments, or maintenance activities such as a “recompose”of one or more Pools, Management VMs such as vCenter and View Composer should have minimal or no compute contention to ensure timely completion of maintenance. This does not fit well in a cluster with >6:1 CPU overcommitment.
7. Having a management cluster minimizes or removes the requirement for complexity/overheads of setting CPU or Memory reservations in an attempt to ensure performance for management VMs competing for compute resources with virtual desktops. (See “Common Mistake – Using CPU reservations to solve CPU ready” for more information)
8. Maximize the efficiency of the CPU scheduler, as the majority of Virtual Desktops should be 1vCPU as compared to management VMs such as vCenter / SQL / Connection brokers which will likely be 2 and 4 vCPU. Scheduling VMs with higher vCPU numbers on an environment with >6:1 vCPU overcommitment is unlikely to result in acceptable performance for the management virtual machines.
9. Having a cluster/s dedicated to desktops will give more flexibility to use features such as Distributed Power Management (DPM) for VMware View Blocks which will help achieve a faster ROI
10. vCenter’s workload with virtual desktops is generally higher (compared to vCenter servers managing server workloads) due to increased frequency of things like power operations and provisioning operations from View Composer. One (1) vCenter should be used per Block, or up to 2000 users.
11. In the event of performance/stability issues in the View Block/s, if the management servers shared the cluster, the ability for vSphere/View administrators to access management servers will likely be impacted, which may delay the troubleshooting process and eventual resolution of the issue/s
12. Having a separate management cluster with dedicated storage (RAID packs/aggregates and/or storage controllers) prevents the IO load of the View Desktops impacting the ability to manage the environment, especially during recompose and provisioning operations.

Implications

1. Hardware will be required for the Management cluster – Although as the ESXi hosts in View Blocks (as they wont be hosting management workloads) should as a result achieve higher consolidation ratios which should close to if not entirely neutralize the cost of the Management Host Hardware
2. The storage solution will need to provide storage for Management virtual machines which is separate to Virtual desktops
3. The scale out approach for the management cluster may not achieve as higher memory savings form transparent page sharing due to having less virtual machines per host
4. Having an additional cluster is an additional administrative overhead, albeit minimal however this should reduce the risk in the environment leading to lower BAU effort/costs.

Alternatives

1. Run Management VMs in VMware View Blocks (with desktop workloads). – Not recommended
2. Run management VMs in an existing vSphere cluster running server workloads (if available)

A special Thanks to Michael Webster (VCDX#66) for his contribution to this example Architectural decision.

CloudXC

By Josh Odgers – VMware Certified Design Expert (VCDX) #90

Tag Archives: management cluster

Example Architectural Decision – BC/DR Solution for vCloud Director

Example Architectural Decision – Supporting VMware View Infrastructure Servers

Share this:

Share this: