Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (2 of 2)

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for a Virtual Desktop environment?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. Average VDI user is Task Worker with 1vCPU and 2GB Ram.
5. Memory is the first compute level constraint.
6. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
7. vSphere 5.5 or earlier

Requirements

1. VDI environment costs must be minimized

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Enable TPS and disable Large Memory pages

Justification

1. Disabling Large pages is essential to maximizing the benefits of TPS
2. Not disabling large pages would likely result in minimal TPS savings
3. With Kiosk and Task worker VDI profiles, the percentage of memory which is likely to be shared is higher than for Power users.
4. Existing shared storage has plenty of spare Tier 1 capacity to vSwap files

Implications

1. Sufficient capacity for VM swap files must be catered for.
2. VDI & Storage performance may be impacted significantly in the event of memory contention.
3. Decreased memory costs may result in increased storage costs.
4. During patching, and operational verification that non default settings have not been reverted by the patching of ESXi.
5. Additional CPU overhead on ESXi from enabling TPS.
6. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM,
6. HA admission control (when configured to Percentage of Cluster resources reserved for HA) will only calculate fail-over capacity based on 0MB + VM overhead for each VM which can lead to significantly degraded performance in a HA event.
7. Higher core count (and higher cost) CPUs may be desired to drive overcommitment ratios as RAM will be less likely to be a point of contention.

Alternatives

1. Use 100% memory reservation and leave TPS disabled (default)
2. Use 50% memory reservation and Enable TPS and disable large pages

Related Articles:

1. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

2. Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (1 of 2)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl(VCDX#104)

The Impact of Transparent Page Sharing (TPS) being disabled by default

Recently VMware announced via the VMware Security Blog, that Transparent Page Sharing (TPS) will be disabled by default in an upcoming update of ESXi.

Since this announcement I have been asked how will this impact sizing vSphere solutions and as a result I’ve been involved in discussions about the impact of this on Business Critical Application, Server and VDI solutions.

Firstly what benefits does TPS provide? In my experience, in recent times with large memory pages essentially not being compatible with TPS, even for VDI environments where all VMs are running the same OS, the benefits have been minimal, in general <20% if that.

Memory overcommitment in general is not something that can achieve significant savings from because memory is much harder to overcommit than CPU. Overcommitment can be achieved but only where memory is not all being used by the VM/OS & Applications, in which case, simply right sizing VMs will give similar memory saving and likely result in better overall VM and cluster performance.

So to begin, in my opinion TPS is in most cases overrated.

Next Business Critical Applications (vBCA):

In my experience, Business Critical Applications such as MS Exchange, MS SQL , Oracle would generally have memory reservations, and in most cases the memory reservation would be 100% (All Memory Locked).

As a result, in most environments running vBCA’s, TPS has no benefits already, so TPS being disabled has no significant impact for these workloads.

Next End User Computing (EUC) Solutions:

There are a number of EUC solutions, such as Horizon View , Citrix XenDesktop and Citrix PVS which all run very well on vSphere.

One common issue with EUC solutions is architects fail to consider the vSwap storage requirements for Virtual Servers (for Citrix PVS) or VDI such as Horizon View.

As a result, a huge amount of Tier 1 storage can be wasted with vswap file storage. This can be up to the amount vRAM allocated to VMs less memory reservations!

The last part is a bit of a hint, how can we reduce or eliminate the need for Tier 1 storage of vSwap? By using Memory Reservations!

While TPS can provide some memory savings, I would invite you to consider the cost saving of eliminating the need for vSwap storage space on your storage solution, and the guarantee of consistent performance (at least from a memory perspective) outweigh the benefits of TPS.

Next Virtual Server Solutions:

Lets say we’re talking about general production servers excluding vBCAs (discussed earlier). These servers are providing applications and functions to your end users so consistent performance is something the business is likely to demand.

When sizing your cluster/s, architects should size for at least N+1 redundancy and to have memory utilization around the 1:1 mark in a host failure scenario. (i.e.: Size your cluster assuming a host failure or maintenance of one host is being performed).

As a result, any reasonable architectural assumption around TPS savings would be minimal.

As with EUC solutions, I would again invite you to consider the cost saving of eliminating the vSwap storage requirement and the guarantee of consistent performance outweigh the benefits of TPS.

Next Test/Dev Environments:

This is probably the area where TPS will provide the most benefit, where memory overcommitment ratios can be much higher as the impact to the applications(VMs) of memory saving techniques such as swapping/ballooning should not have as high an impact on the business as with vBCA, EUC or Server workloads.

However, what is Test/Dev for? In my opinion, Test/Dev should where possible simulate production conditions so the operational verification of an application can be accurately conducted before putting the workloads into production. As such, the Test/Dev VMs should be configured the same way as they are intended to be put into production, including Memory Reservations and CPU overcommitment.

So, can more compute overcommitment be achieved in Test/Dev, sure, but again is the impact of vSwap space, potentially inconsistent performance and the increased risk of operational verification not being performed to properly simulate product worth the minimal benefits of TPS?

Summary

If VMware believe TPS is a significant enough security issue to make it disabled by default, this is something architects should consider, however I would argue there are many other areas where security is a much larger issue, but that’s a different topic.

TPS being disabled by default is likely to only impact a small percentage of virtual workloads and with RAM being one of the most inexpensive components in the datacenter, ensuring consistent performance by using Memory Reservations and eliminating the architectural considerations and potentially high storage costs for VMs vSwap make leaving TPS disabled an attractive option regardless of if its truly a security advantage or not.

Related Articles:

1. Future direction of disabling TPS by default and its impact on capacity planning – @FrankDenneman (VCDX #29)

2. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl (VCDX#104)

Example Architectural Decision – Storage Protocol Choice for a Horizon View Environment

Problem Statement

What is the most suitable storage protocol for a Virtual Desktop (Horizon View) environment using Linked Clones?

Assumptions

1. VMware View 5.3 or later

Motivation

1. Minimize recompose (maintenance) window
2. Minimize impact on the storage array and HA/DRS cluster during recompose activities
3. Reduce storage costs where possible
4. Simplify the storage design eg: Number/size of Datastores / Storage Connectivity
5. Reduce the total solution cost eg: Number of Hosts required

Architectural Decision

Use Network File System (NFS)

Justification

1. Using native NFS snapshot (VCAI) offloads the creation of VMs to the array, therefore reducing the compute overhead on the ESXi hosts
2. Native NFS snapshots require much less disk space than traditional linked clones
3. Recomposition times are reduced due to the offloading of the cloning to the array
4. More virtual machines can be supported per NFS datastore compared to VMFS datastores (200+ for NFS compared to max recommended of 140, but it is generally recommended to design for much lower numbers eg: 64 per VMFS)
5. Recompositions/Refresh activities can be performed during business hours, or at Logoff (for Refresh) with minimal impact to the HA/DRS cluster, thus giving more flexibility to maintain the environment
6. Avoid’s potential VMFS locking issues – although this issue is not as important for environments using vSphere 4.1 onward with VAAI compatible arrays
7. When sizing your storage array, less capacity is required. Note: Performance sizing is also critical
8. The cost and complexity of a FC Storage Area Network can be avoided
9. Fewer ESXi hosts may be required as the compute overhead of driving cloning has been removed thus reducing cost
10. VCAI is fully supported feature in Horizon View 5.3

Implications

1. The Storage Array supports NFS native snapshot offload to enable the full benefit of NFS (VCAI clones) however all other benefits remain without VCAI support.

Alternatives

1. Use VMFS (block) based datastores via iSCSI or FC/FCoE and have more VMFS datastores – Note: Recompose activity will be driven by the host which adds an overhead to the cluster. (Not Recommended)