Example Architectural Decision – Securing vMotion & Fault Tolerance Traffic in IaaS/Cloud Environments

Problem Statement

vMotion and Fault tolerance logging traffic is unencrypted and anyone with access to the same VLAN/network could potentially view and/or compromise this traffic. How can the environment be made as secure as possible to ensure security between in a multi-tenant/multi-department environment?

Assumptions

1.  vMotion and FT is required in the vSphere cluster/s (although FT is currently not supported for VMs hosted with vCloud Director)
2. IP Storage is being used and vNetworking has 2 x 10GB for non Virtual Machine traffic such as VMKernel’s & 2 x 10GB NICs are available for Virtual Machine traffic (Similar to Example vNetworking Design for IP Storage)
3. VI3 or later

Motivation

1. Ensure maximum security and performance for vMotion and FT traffic
2. Prevent vMotion and/or FT traffic impacting production virtual machines

Architectural Decision

vMotion & Fault tolerance logging traffic will each have a dedicated non routable VLAN which will be hosted on a dvSwitch which is physically separate from virtual machine distributed virtual switch.

Justification

1.  vMotion / FT traffic does not require external (or public) access
2. A VLAN per function ensures maximum security / performance with minimal design / implementation overhead
3. Prevent vMotion and/or FT traffic potentially impacting production virtual machine and vice versa by having the traffic share one or more broadcast domain/s
4. Ensure vMotion/FT traffic cannot leave there respective dedicated VLAN/s and potentially be sniffed

Implications

1. Two (2) VLANs with private IP ranges are required to be presented over 802.1q connections to the appropriate pNICs

Alternatives

1.  vMotion / FT share the ESXi management VLAN – This would increase risk of traffic being intercepted and “sniffed”
2. vMotion / FT share a dvSwitch with Virtual Machine networks while still running within dedicated non routable VLANs over 802.1q

Example Architectural Decision – DRS Automation Level

Problem Statement

What is the most suitable DRS automation level and migration threshold for a vSphere cluster running an IaaS offering with a self service portal w/ unpredictable workloads?

Assumptions

1. Workload types and size are unpredictable in a IaaS environment, workloads may vary greatly and without notice
2. The solution needs to be as automated as possible without introducing significant risk

Motivation

1. Prevent unnecessary vMotion migrations which will impact host & cluster performance
2.Ensure the cluster standard deviation is minimal
3. Reduce administrative overhead of reviewing and approving DRS recommendations

Alternatives

1.Use Fully automated and Migration threshold 1 – Apply priority 1 recommendations
2.Use Fully automated and Migration threshold 2- Apply priority 1 & 2 recommendations
3. Use Fully automated and Migration threshold 4- Apply priority 1,2,3 and 4 recommendations
4.Use Fully automated and Migration threshold 5- Apply priority 1,2,3,4 & 5 recommendations
5. Set DRS to manual and have a VMware administrator assess and apply recommendations

Justification

1. Prevent excessive vMotion migrations that do not provide significant benefit to cluster balance as the vMotion itself will use cluster and network resources
2. Ensure cluster remains in a reasonably load balanced state without resource being wasted on load balancing for minimal improvement
3. DRS is a low risk, proven technology which has been used in large production environments for many years
4. Setting DRS to manual would be a significant administrative overhead and introduce additional risk from human error
5. Setting a more aggressive DRS migration threshold would put an additional load on the cluster which will likely not result in significantly better balance

Architectural Decision

Use DRS in Fully Automated mode with setting “3” – Apply priority 1,2 and 3 recommendations

Implications

1. DRS will not move workloads via vMotion where only a moderate improvement to the cluster will be achieved
2. At times, including after performing updates (via VUM) of ESXi hosts the cluster may appear to be unevenly balanced as DRS may calculate minimal benefit from migrations. Setting DRS to “Use Fully automated and Migration threshold 5” for a short period of time following maintenance should result in a more evenly balanced cluster.