Example Architectural Decision – ESXi Host Hardware Sizing (Example 1)

Problem Statement

What is the most suitable hardware specifications for this environments ESXi hosts?

Requirements

1. Support Virtual Machines of up to 16 vCPUs and 256GB RAM
2. Achieve up to 400% CPU overcommitment
3. Achieve up to 150% RAM overcommitment
4. Ensure cluster performance is both consistent & maximized
5. Support IP based storage (NFS & iSCSI)
6. The average VM size is 1vCPU / 4GB RAM
7. Cluster must support approx 1000 average size Virtual machines day 1
8. The solution should be scalable beyond 1000 VMs (Future-Proofing)
9. N+2 redundancy

Assumptions

1. vSphere 5.0 or later
2. vSphere Enterprise Plus licensing (to support Network I/O Control)
3. VMs range from Business Critical Application (BCAs) to non critical servers
4. Software licensing for applications being hosted in the environment are based on per vCPU OR per host where DRS “Must” rules can be used to isolate VMs to licensed ESXi hosts

Constraints

1. None

Motivation

1. Create a Scalable solution
2. Ensure high performance
3. Minimize HA overhead
4. Maximize flexibility

Architectural Decision

Use Two Socket Servers w/ >= 8 cores per socket with HT support (16 physical cores / 32 logical cores) , 256GB Ram , 2 x 10GB NICs

Justification

1. Two socket 8 core (or greater) CPUs with Hyper threading will provide flexibility for CPU scheduling of large numbers of diverse (vCPU sized) VMs to minimize CPU Ready (contention)

2. Using Two Socket servers of the proposed specification will support the required 1000 average sized VMs with 18 hosts with 11% reserved for HA to meet the required N+2 redundancy.

3. A cluster size of 18 hosts will deliver excellent cluster (DRS) efficiency / flexibility with minimal overhead for HA (Only 11%) thus ensuring cluster performance is both consistent & maximized.

4. The cluster can be expanded with up to 14 more hosts (to the 32 host cluster limit) in the event the average VM size is greater than anticipated or the customer experiences growth

5. Having 2 x 10GB connections should comfortably support the IP Storage / vMotion / FT and network data with minimal possibility of contention. In the event of contention Network I/O Control will be configured to minimize any impact (see Example VMware vNetworking Design w/ 2 x 10GB NICs)

6. RAM is one of the most common bottlenecks in a virtual environment, with 16 physical cores and 256GB RAM this equates to 16GB of RAM per physical core. For the average sized VM (1vCPU / 4GB RAM) this meets the CPU overcommitment target (up to 400%) with no RAM overcommitment to minimize the chance of RAM becoming the bottleneck

7. In the event of a host failure, the number of Virtual machines impacted will be up to 64 (based on the assumed average size VM) which is minimal when compared to a Four Socket ESXi host which would see 128 VMs impacted by a single host outage

8. If using Four socket ESXi hosts the cluster size would be approx 10 hosts and would require 20% of cluster resources would have to be reserved for HA to meet the N+2 redundancy requirement. This cluster size is less efficient from a DRS perspective and the HA overhead would equate to higher CapEx and as a result lower the ROI

9. The solution supports Virtual machines of up to 16 vCPUs and 256GB RAM although this size VM would be discouraged in favour of a scale out approach (where possible)

10. The cluster aligns with a virtualization friendly “Scale out” methodology

11. Using smaller hosts (either single socket, or less cores per socket) would not meet the requirement to support supports Virtual machines of up to 16 vCPUs and 256GB RAM , would likely require multiple clusters and require additional 10GB and 1GB cabling as compared to the Two Socket configuration

12. The two socket configuration allows the cluster to be scaled (expanded) at a very granular level (if required) to reduce CapEx expenditure and minimize waste/unused cluster capacity by adding larger hosts

13. Enabling features such as Distributed Power Management (DPM) are more attractive and lower risk for larger clusters and may result in lower environmental costs (ie: Power / Cooling)

Alternatives

1.  Use Four Socket Servers w/ >= 8 cores per socket , 512GB Ram , 4 x 10GB NICs
2.  Use Single Socket Servers w/ >= 8 cores , 128GB Ram , 2 x 10GB NICs
3. Use Two Socket Servers w/ >= 8 cores , 512GB Ram , 2 x 10GB NICs
4. Use Two Socket Servers w/ >= 8 cores , 384GB Ram , 2 x 10GB NICs
5. Have two clusters of 9 hosts with the recommended hardware specifications

Implications

1. Additional IP addresses for ESXi Management, vMotion, FT & Out of band management will be required as compared to a solution using larger hosts

2. Additional out of band management cabling will be required as compared to a solution using larger hosts

Related Articles

1. Example Architectural Decision – Network I/O Control for ESXi Host using IP Storage (4 x 10 GB NICs)

2. Example VMware vNetworking Design w/ 2 x 10GB NICs

3. Network I/O Control Shares/Limits for ESXi Host using IP Storage

4. VMware Clusters – Scale up for Scale out?

5. Jumbo Frames for IP Storage (Do not use Jumbo Frames)

6. Jumbo Frames for IP Storage (Use Jumbo Frames)

CloudXClogo

 

Example Architectural Decision – Host Isolation Response for IP Storage

Problem Statement

What are the most suitable HA / host isolation response when using IP based storage (In this case, Netapp HA Pair in 7-mode) when the IP storage runs over physically separate network cards and switches to ESXi management?

Assumptions

1. vSphere 5.0 or greater (To enable use of Datastore Heartbearting)
2. vFiler1 & vFiler2 reside on different physical Netapp Controllers (within the same HA Pair in 7-mode)
3. Virtual Machine guest operating systems with an I/O timeout of 190 seconds to allow for a Controller fail-over (Maximum 180 seconds)

Motivation

1. Minimize the chance of a false positive isolation response
2.Ensure in the event the storage is unavailable that virtual machines are promptly shutdown to minimize impact on the applications/data.

Architectural Decision

Turn off the default isolation address and configure the below specified isolation addresses, which check connectivity to multiple Netapp vFilers (IP storage) on the vFiler management VLAN and the IP storage interface.

Utilize Datastore heartbeating, checking multiple datastores hosted across both Netapp controllers (in HA Pair) to confirm the datastores themselves are accessible.

Services VLANs
das.isolationaddress1 : vFiler1 Mgmt Interface 192.168.1.10
das.isolationaddress2 : vFiler2 Mgmt Interface 192.168.2.10

IP Storage VLANs
das.isolationaddress3 : vFiler1 vIF 192.168.10.10
das.isolationaddress4 : vFiler2 vIF 192.168.20.10

Configure Datastore Heartbeating with “Select any of the clusters datastores taking into account my preference” and select the following datastores

  • One datastore from vFiler1 (Preference)
  • One datastore from vFiler2 (Preference)
  • A second datastore from vFiler1
  • A second datastore from vFiler2

Configure Host Isolation Response to: Power off.

Justification

1. The ESXi Management traffic is running on a standard vSwitch with 2 x 1GB connections which connect to different physical switches to the IP storage (and Data) traffic (which runs over 10GB connections). Using the ESXi management gateway (default isolation address) to deter main isolation is not suitable as the management network can be offline without impacting the IP storage or data networks. This situation could lead to false positives isolation responses.
2. The isolation addresses chosen test both data and IP storage connectivity over the converged 10Gb network
3. In the event the four isolation addresses (Netapp vFilers on the Services and IP storage interfaces) cannot be reached by ICMP, Datastore heartbeating will be used to confirm if the specified datastores (hosted on separate physical Netapp controllers) are accessible or not before any isolation action will be taken.
4. In the event the two storage controllers do not respond to ICMP on either the Services or IP storage interfaces, and both the specified datastores are inaccessible, it is likely there has been a catastrophic failure in the environment, either to the network, or the storage controllers themselves, in which case the safest option is to shutdown the VMs.
5. In the event the isolation response is triggered and the isolation does not impact all hosts within the cluster, the VM will be restarted by HA onto a surviving host.

Implications

1. In the event the host cannot reach any of the isolation addresses, and datastore heartbeating cannot access the specified datastores, virtual machines will be powered off.

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address

For more details, refer to my post “VMware HA and IP Storage