Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for Mixed Production Servers (1 of 2)

Problem Statement

In a VMware vSphere environment, with future releases of ESXi disabling Transparent Page Sharing by default, what is the most suitable TPS configuration for an environment running mixed production server workloads?

Assumptions

1. TPS is disabled by default
2. Storage is expensive
3. Two Socket ESXi Hosts have been chosen to align with a scale out methodology.
4. Average Server VM is between 2-4vCPU and 4-8GB Ram with some larger.
5. Memory is the first compute level constraint.
6. HA Admission Control policy used is “Percentage of Cluster Resources reserved for HA”
7. vSphere 5.5 or earlier

Requirements

1. The environment must deliver consistent performance
2. Minimize the cost of shared storage

Motivation

1. Reduce complexity where possible.
2. Maximize the efficiency of the infrastructure

Architectural Decision

Leave TPS disabled (default) and leave Large Memory pages enabled (default).

Justification

1. Setting 100% memory reservations ensures consistent performance by eliminating the possibility of swapping.
2. The 100% memory reservation also eliminates the capacity usage by the vswap file which saves space on the shared storage as well as reducing the impact on the storage in the event of swapping.
3. RAM is cheaper than Tier 1 storage (which is recommended for vSwap storage to ensure minimal performance impact during swapping) so the increased cost of memory in the hosts is easily offset by the saving in Tier 1 shared storage.
4. Simplicity. Leaving default settings is advantageous from both an architectural and operational perspective.  Example: ESXi Patching can cause settings to revert to default which could negate TPS savings and put a sudden high demand on storage where TPS savings are expected.
5. TPS savings for server workloads is typically much less than with desktop workloads and as a result less attractive.
6. The decision has been made to use 2 socket ESXi hosts and scale out so the TPS savings per host compared to a 4 socket server with double the RAM will be lower.
7. HA admission control will calculate fail-over requirements (when using Percentage of cluster resources reserved for HA) so that performance will be approximately the same in the event of a fail-over due to reserving the full RAM reserved for every VM leading to more consistent performance under a wider range of circumstances.
8. Lower core count (and lower cost) CPUs will likely be viable as RAM will likely be the first constraint for further consolidation.
9. Remove the real or perceived security risk of sensitive information being gathered from other VMs using TPS as described in VMware KB 2080735

Implications

1. Using 100% memory reservations requires ESXi hosts and the cluster be sized at a 1:1 ratio of vRAM to pRAM (Physical RAM) and should include N+1 so a host failure can be tolerated.
2. Increased RAM costs
3. No memory overcommitment can be achieved
4. Potential for lower CPU utilization / overcommitment as RAM may become the first constraint.

Alternatives

1. Use 50% reservation and enable TPS
2. Use no reservation, Enable TPS and disable large pages

Related Articles:

1. The Impact of Transparent Page Sharing (TPS) being disabled by default @josh_odgers (VCDX#90)

2. Example Architectural Decision – Transparent Page Sharing (TPS) Configuration for Production Servers (2 of 2)

3. Future direction of disabling TPS by default and its impact on capacity planning –@FrankDenneman (VCDX #29)

4. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl (VCDX#104)

The Impact of Transparent Page Sharing (TPS) being disabled by default

Recently VMware announced via the VMware Security Blog, that Transparent Page Sharing (TPS) will be disabled by default in an upcoming update of ESXi.

Since this announcement I have been asked how will this impact sizing vSphere solutions and as a result I’ve been involved in discussions about the impact of this on Business Critical Application, Server and VDI solutions.

Firstly what benefits does TPS provide? In my experience, in recent times with large memory pages essentially not being compatible with TPS, even for VDI environments where all VMs are running the same OS, the benefits have been minimal, in general <20% if that.

Memory overcommitment in general is not something that can achieve significant savings from because memory is much harder to overcommit than CPU. Overcommitment can be achieved but only where memory is not all being used by the VM/OS & Applications, in which case, simply right sizing VMs will give similar memory saving and likely result in better overall VM and cluster performance.

So to begin, in my opinion TPS is in most cases overrated.

Next Business Critical Applications (vBCA):

In my experience, Business Critical Applications such as MS Exchange, MS SQL , Oracle would generally have memory reservations, and in most cases the memory reservation would be 100% (All Memory Locked).

As a result, in most environments running vBCA’s, TPS has no benefits already, so TPS being disabled has no significant impact for these workloads.

Next End User Computing (EUC) Solutions:

There are a number of EUC solutions, such as Horizon View , Citrix XenDesktop and Citrix PVS which all run very well on vSphere.

One common issue with EUC solutions is architects fail to consider the vSwap storage requirements for Virtual Servers (for Citrix PVS) or VDI such as Horizon View.

As a result, a huge amount of Tier 1 storage can be wasted with vswap file storage. This can be up to the amount vRAM allocated to VMs less memory reservations!

The last part is a bit of a hint, how can we reduce or eliminate the need for Tier 1 storage of vSwap? By using Memory Reservations!

While TPS can provide some memory savings, I would invite you to consider the cost saving of eliminating the need for vSwap storage space on your storage solution, and the guarantee of consistent performance (at least from a memory perspective) outweigh the benefits of TPS.

Next Virtual Server Solutions:

Lets say we’re talking about general production servers excluding vBCAs (discussed earlier). These servers are providing applications and functions to your end users so consistent performance is something the business is likely to demand.

When sizing your cluster/s, architects should size for at least N+1 redundancy and to have memory utilization around the 1:1 mark in a host failure scenario. (i.e.: Size your cluster assuming a host failure or maintenance of one host is being performed).

As a result, any reasonable architectural assumption around TPS savings would be minimal.

As with EUC solutions, I would again invite you to consider the cost saving of eliminating the vSwap storage requirement and the guarantee of consistent performance outweigh the benefits of TPS.

Next Test/Dev Environments:

This is probably the area where TPS will provide the most benefit, where memory overcommitment ratios can be much higher as the impact to the applications(VMs) of memory saving techniques such as swapping/ballooning should not have as high an impact on the business as with vBCA, EUC or Server workloads.

However, what is Test/Dev for? In my opinion, Test/Dev should where possible simulate production conditions so the operational verification of an application can be accurately conducted before putting the workloads into production. As such, the Test/Dev VMs should be configured the same way as they are intended to be put into production, including Memory Reservations and CPU overcommitment.

So, can more compute overcommitment be achieved in Test/Dev, sure, but again is the impact of vSwap space, potentially inconsistent performance and the increased risk of operational verification not being performed to properly simulate product worth the minimal benefits of TPS?

Summary

If VMware believe TPS is a significant enough security issue to make it disabled by default, this is something architects should consider, however I would argue there are many other areas where security is a much larger issue, but that’s a different topic.

TPS being disabled by default is likely to only impact a small percentage of virtual workloads and with RAM being one of the most inexpensive components in the datacenter, ensuring consistent performance by using Memory Reservations and eliminating the architectural considerations and potentially high storage costs for VMs vSwap make leaving TPS disabled an attractive option regardless of if its truly a security advantage or not.

Related Articles:

1. Future direction of disabling TPS by default and its impact on capacity planning – @FrankDenneman (VCDX #29)

2. Transparent Page Sharing Vulnerable, Yet Largely Irrelevant – @ChrisWahl (VCDX#104)

VMworld then and now! (2013 vs 2014)

Last year I did an interview with Eric Sloof @esloof of VMworld TV (below) where we discussed the basics (or the 101) of Nutanix and this was the theme of questions from attendees throughout the Solutions Exchange.

Meet the team behind Nutanix VMworld 2013 – https://www.youtube.com/watch?v=T56KBaB3OUk

Jump forward to this years VMworld (2014) and I was lucky enough to get an opportunity to interview with  VMworld TV again. Eric and I agreed that we didn’t want a simple repeat of last years interview, but talk about more benefits of the platform (or the 201 level).

Nutanix speaks to VMworld TV about their exciting new products – https://t.co/brA15Zgcql

The interesting part of VMworld 2014 and my time on the booth, the theme of questions from attendees was significantly different from last year, and in large part was focused on Business Critical Applications and Server workloads from both prospective and existing customers.

One of my focusses over the last year has been Business Critical Applications and improving the Nutanix platform for these workloads. I am proud to say we (Nutanix) has made significant improvements in this area and we have a strong offering especially with the new NX-8150 platform which my team were responsible for designing.

I am looking forward to interviewing at next years VMworld and covering the Advanced/Expert level topics (301 level) with Eric and the fantastic VMworld TV crew.