Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (1 of 2)

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for a Virtual Desktop environment?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
5. vSphere 5.5 or earlier

Requirements

1. VDI environment must deliver consistent performance
2. VDI environment supports a high percentage of Power Users

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Leave TPS disabled (default) and apply 100% Memory Reservations to VDI VMs and/or Golden Master Image.

Justification

1. Setting 100% memory reservations ensures consistent performance by eliminating the possibility of swapping.
2. The 100% memory reservation also eliminates the capacity usage by the vswap file which saves space on the shared storage as well as reducing the impact on the storage in the event of swapping.
3. RAM is cheaper than Tier 1 storage (which is recommended for vSwap storage to ensure minimal performance impact during swapping) so the increased cost of memory in the hosts is easily offset by the saving in shared storage.
4. Simplicity. Leaving default settings is advantageous from both an architectural and operational perspective.  Example: ESXi Patching can cause settings to revert to default which could negate TPS savings and put a sudden high demand on storage where TPS savings are expected.
5. TPS savings for desktops can be significant, however with a high percentage of Power Users with >=4GB desktops and 2vCPUs, the TPS savings are lower compared to Kiosk or Task users typically with 1-2GB per desktop.
6. The decision has been made to use 2 socket ESXi hosts and scale out so the TPS savings per host compared to a 4 socket server with double the RAM will be lower.
7. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM leading to more consistent performance under a wider range of circumstances.
8. Lower core count (and lower cost) CPUs will likely be viable as RAM will likely be the first constraint for further consolidation.

Implications

1. Using 100% memory reservations requires ESXi hosts and the cluster be sized at a 1:1 ratio of vRAM to pRAM (Physical RAM) and should include N+1 so a host failure can be tolerated.
2. Increased RAM costs
3. No memory overcommitment can be achieved
4. Potential for lower CPU utilization / overcommitment as RAM may become the first constraint.

Alternatives

1. Use 50% reservation and enable TPS
2. Use no reservation, Enable TPS and disable large pages

Related Articles:

1. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

2. Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (2 of 2)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl(VCDX#104)

3 thoughts on “Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for VDI (1 of 2)

  1. Another great post!

    Having myself to re-evaluate a design due to VMware latest statement about TPS I have a few questions regarding your architectural decisions:

    1- Requirement #1, if you require overall consistent performance why not have a a 1:1 CPU ratio as well especially with requirement #2?
    2- Requirement #2 looks more like a constraint than a requirement as it narrows down and limits your design options, doesn’t it?
    3- Justification #3, is memory really cheaper than Tier 1 storage?
    4- Justification #5, TPS achieves higher saving when VM profiles are likely identical (same OS and same set of applications) therefore a high percentage of power user profiles will likely produce greater TPS savings whatever the VM memory size, isn’t?
    5- Justification #6, I didn’t know this dependency between number of socket and TPS saving…could you develop?
    6- Justification #7, going for percentage increases management overhead. Percentage has to be manually re-evaluated every time when adding VMs and/or adding nodes. Shouldn’t that be easier to use the classy host failure especially when having quite homogenous VM profiles?
    7- Justification #8, you’re making an assumption that power users are memory bound. They could be CPU bound! I guess a capacity planning tool would have spotted this and that would end up as a constraint for the design…

    Thanks,
    Didier

    • Hey Didier,

      I’ll try to tackle each of your questions comments one at a time.

      Q1: 1- Requirement #1, if you require overall consistent performance why not have a a 1:1 CPU ratio as well especially with requirement #2?

      A1: For VDI, a 1:1 vCPU to pCore ratio is really unnecessary and would likely kill the value prop for VDI completely. CPU overcommitment even for power users can be easily >6:1 and as high as 12:1 while providing excellent performance.

      Q2: Requirement #2 looks more like a constraint than a requirement as it narrows down and limits your design options, doesn’t it?

      A2: I’m suggesting its a requirement, but whatever we want to call it, the key here is the Power User profile is higher CPU/RAM than a Kiosk or Task worker and we wan’t to consider that TPS will generally be less effective the larger RAM per VM.

      Q3: Justification #3, is memory really cheaper than Tier 1 storage?

      A3: In my experience, Yes. Happy to be proven wrong though 🙂 Swap files need to be on reasonably high performance disk (SAN/NAS/Hyperconverged) so these solution are generally high performance and much higher cost than the additional RAM to allow for partial or full memory reservations.

      Q4: TPS achieves higher saving when VM profiles are likely identical (same OS and same set of applications) therefore a high percentage of power user profiles will likely produce greater TPS savings whatever the VM memory size, isn’t?

      A4: Power Users may have the same set of applications, and they may not. As a rule of thumb from what I’ve seen, is the larger the RAM per VM, the less % of RAM is generally able to be shared. However this may vary from customer to customer.

      Q5: I didn’t know this dependency between number of socket and TPS saving…could you develop?

      A5: There isn’t a dependancy on CPU sockets, my point here is a 4 socket box can support and will generally be configured with much more memory than a 2 socket host, and as a result, have a much higher chance of TPS providing higher savings due to the higher number of VMs per host which can share memory.

      Q6: going for percentage increases management overhead. Percentage has to be manually re-evaluated every time when adding VMs and/or adding nodes. Shouldn’t that be easier to use the classy host failure especially when having quite homogenous VM profiles?

      A6: True, it does. However I would consider this overhead minimal at best compared to the advantages of Percentage based reservation. Host Failures cluster tolerates is based on the slot sizing algorithm which can be very inefficient with memory reservations.

      Q7: you’re making an assumption that power users are memory bound. They could be CPU bound! I guess a capacity planning tool would have spotted this and that would end up as a constraint for the design…

      A7: I guess your right, I am making this assumption. For VDI this is far and away been my experience as CPU can be heavily overcommitment (as mentioned earlier) where as RAM is generally well <1.5:1 even with TPS and large memory pages disabled. Capacity planning is always important, as is running a Proof of Concept during the planning/design phase for any large scale project.

      Hope that helps, thanks for the comment.