Common Mistake: Using CPU reservations to solve CPU Ready

One of the more common problems I see in virtual environments is over sized virtual machines which typically results in lower performance, and your guessed it, high CPU Ready.

What is CPU Ready?

CPU ready is basically the time it takes a VM to be scheduled onto physical core after it is placed in the CPU scheduling queue.

What is High CPU Ready?

In my opinion, during peak load, anything above 2% (or 400ms) is a concern and should be monitored. Above 5% will be impacting performance (resulting in lower CPU utilization) and 10% or more, should be considered a serious problem and remediated immediately.

The below is a screenshot showing CPU ready from a recent test I conducted in my home lab

To calculate the percentage of CPU Ready, we divide the VMs “Summation” value (in the screen shot above its the “W2K8 CPU TEST VM 1” line by 20000 (ms) which is the statistics collection interval, then divide the result by the number of vCPUs in the VM.

So if we use the value from the “latest” column, its 7337 divide 20000, equals : 0.36685, then we divide that by 2 as the VM has 2 vCPUs and we end up with 0.183425

That’s 18% CPU Ready, which basically means 18% of the time, the VM is not doing anything!

Note: CPU Ready % can be found using ESXTOP or RESXTOP via the vMA or on the ESXi host directly.

Now to try and diagnose the Performance/CPU ready issue, we need to work out if the VM is oversized and if so, Right Size the VM.

What is an Oversized VM?

Basically a VM which has more compute resources assigned than it requires, for example, a VM which uses no more than 20% of its CPU and has 4 vCPUs.
What is Right Sizing?

In the above example, the VM is oversized as it doesn’t use more than 1vCPU (or 25%) of the CPU resources and therefore could be reduced to to 1 vCPU and run at 80%.
So the VM is oversized, and has High CPU ready, what happens when we right size it from 4vCPUs to 1vCPU and why does this help performance?
Its pretty simple, the less vCPUs a VM has, the easier job the CPU scheduler has to find enough physical cores to schedule the VM onto. If a cluster has a lot of oversized VMs, all the VMs are all competing for the same physical cores, and making it more and more difficult for the scheduler.

But what about setting a CPU Reservation? Don’t reservations “guarantee” resources?

The answer is, Yes and No.

The reservation “reserves” CPU resources measured in Mhz, but this has nothing to do with the CPU scheduler.

So setting a reservation will help improve performance for the VM you set it on, but will not “solve” CPU ready issues caused by “oversized” VMs, or by too high an overcommitment ratio of CPU resources.

In my testing I set an 80% reservation of a VMs 2 vCPUs worth of Mhz and prior to setting the reservation the CPU ready was ~20% and then CPU Ready did drop to around 10%. Note: This test was performed with only 25% overcommitment – 5 vCPUs on 4 physical Cores using CPUBUSY to keep the CPUs running at 100% (measured within the guest by Windows Task Manager).

I then set a 100% reservation of the VMs 2 vCPUs worth of Mhz, prior to setting the reservation the CPU ready was ~10% and CPU Ready did not get below 2.5% even with 100% reservation.

The result would have been exponentially worse had I tested with 50% or 100% overcommitment which is generally easily achieved with VMware and a well architected cluster. (I have seen well above these overcommitment numbers with no CPU ready issues).

Reducing CPU Ready down to 2.5% may sound like a pretty good result, but when we look at the other 3 x 1vCPU VMs on the host (4 core test ESXi 5 host) they had CPU ready of 40%!! Not to mention 2.5% is still not good!

If you have poor performance, and you discover you have High CPU Ready the best solution is  Right Size Your VMs!

I have recommended exactly that countless times and the customers never believe that performance can increase with less vCPUs, until after the Right Sizing exercise.

If after Right sizing, you still have CPU Ready, your overcommitment on CPU is simply to high for the workloads within your cluster.

You can address this by

1. Adding additional compute to the cluster. (Duh!)

2. Using Affinity rules to locate complimentary workloads together (Lots of small 1vCPU VMs which don’t have high CPU utilization will generally work well with a limited number of higher vCPU VMs)

3. Use Anti-Affinity rules to separate non complimentary workloads (eg: Don’t place all your 8vCPU VMs on one host with 300% overcommitment on CPU and expect them to work well).

4. Scaling out (not up) your VMs ie: Don’t have one 8 vCPU SQL DB server, use 4 smaller 2vCPU VMs

So now you know better than to use reservations to solve CPU contention.

Its time too go Right Sizing!

This simple task is about the best bang for buck you will get in your data center, since virtualizing on VMware in the first place.

24 thoughts on “Common Mistake: Using CPU reservations to solve CPU Ready

  1. Because it’s come up a lot, management are slowly realising this.

    The trouble in trying to right size VM’s, is you need to check the usage for the previous month to ensure you capture end of month reporting etc, but the monthly perf charts are highly diluted after the rt/daily/weekly usage has been rolled up. 20% cpu usage on the monthly chart could be 40% in the weekly, and could be 60% on the daily. If the VM does less cpu after hours, it messes the averages.

    Of course it’s solved with vC OPS, but not everyone can get it in the budget.

    I just wish the perf charts could keep the smaller sample sizes for longer.

    • A free alternative for VC OPS would be to run vmware capacity planner or just set the vcenter stats collections settings to retain more information for longer. Beware using the later as your vcenter DB will grow at a faster rate.

  2. Good article and pointer but why would you not link the vCPU ready with with a host pCPU being overcommitted as the first step? Prior to changing the VM Right Size.

    • Good Question Martin, Overcommitment is what we aim for when virtualizing, so we don’t want too discourage it as high levels of overcommitment don’t necessarily translate into poor performance and/or CPU Ready.

      Even in environment with excessive overcommitment, Right Sizing will help improve performance and doesn’t cost ($) the customer anything. So before going out and buying more hardware, if Right sizing is performed, its less likely any costly hardware will be required and even if it is still required, the architect will have a better idea of how much additional hardware is require with Right Sized VMs.

  3. Pingback: High CPU Ready with Low CPU Utilization? « CloudXC

  4. Hi there, i am trying to figure out why a particular VM has high CPU ready at my work and came across this, never heard of right sizing.

    You divided that 7337 figure by 20000 then by the number of vCPUs to work out that %18 of the time the cpu is doing nothing.

    i dont think that formula is correct, e.g. a single 4x vCPU VM running on a ESX with 16CPU’s available and no other VM’s running, This VM would have low CPU ready value’s no matter what if its under high/low load.

    so dividing those “low” values e.g. 300 by 20000 then 4 = .00375
    according to you that means the CPU is doing nothing 0% of the time..

    have i made a miscalulation?

    • Hi Daniel,

      In your example, with a ESXi host with 16 physical sockets (HT or not) with one 4vCPU VM, the VM should not experience any CPU ready as there is no contention with other VMs for the physical cores. If you had 8 x 4vCPU VMs, then you would expect some CPU ready as you have 200% overcommitment, the only question would be how much CPU ready would you have, and this would depend on how active the VMs are.

      Significant CPU ready should only occur when you have a high level of CPU overcommitment (say >400%) and/or in an environment where VMs are not right sized.

      Also, you can check CPU Ready % in the Overiew tab on the grpah in the top left hand corner. This doesnt require any calculations, but, be aware the value you should be looking at is CPU ready per vCPU, not the total aggregate of all vCPUs.

      Hope that helps.

    • I realize this is a little dated by now, but I wanted to clarify something. What your question indicated to me is that you mistakenly made a link between utilization and CPU Ready due to wording. Easy thing to do.

      In the first example, on average 18% of the time that one of the VM’s CPUs has processing to complete, it is waiting on resources to become available. Since it is waiting, it is “not doing anything” during that time, and currently DOES have processing to complete.

      In your second example at ~0% CPU Ready, this means that the CPU is waiting approximately 0% of the time it has processing to complete. The VM may simply not have any processing to complete at all, or it may be at 80% utilization and simply is not being forced to wait.

      CPU Ready and utilization are independent of one another, so 0% CPU Ready does not mean the CPU is doing nothing 0% of the time (or in other words, 100% utilized). It means that 0% of the time the CPU has processing to complete, that it is being forced to wait to complete that processing.

      Hopefully that helps clarify for anyone reading in the future.

      ~Brandon

  5. Great points I have been saying that to customers for quite a while. You can also get Veeam One ralativly inexpensive priced by Socket not vm. I use it all the time with customers and it can look at data collected over time it uses an SQL database and pulls stats from vCenter. Good reports even one for right-sizing.

  6. Pingback: HOSTING IS LIFE! » Course Review: VMware vSphere Optimize and Scale 5.1

  7. Can you help me figure out why my guest’s are getting 100ms cpu ready i know its not really high but just for my own understanding/peace of mind.

    we have one guest with random issues that the vendor is saying could be caused by being on a host with too many guest.

    so i moved this onto a host with only 4x VM’s, this is a 16core 64GB Ram host.
    2 of the existing vm’s are 4 vcpu, and another 2vcpu. that totals 10vcpu’s used.

    when i move the VM in question to this host, all the vm’s have a cpu ready of around 100ms before and after i vmotion this.

    why dont these VM’s have a figure of 1ms as they are not having to wait for cpu’s at all ?

    • Easy test would be move all VMs off one host, and leave the VM which the vendor is complaining about on a dedicated host. (No reservations, just 1 VM on one host in your cluster).

      If the problem continues, you know its not compute contention (at least for that VM) and then you can look at storage/network.

      One caveat, if that VM is dependant on other VMs in your cluster, and they are experiencing contention, this could be a flow on effect to the VM in question.

      Also, check the Overview tab under Performance for that VM, and check the percentage of CPU ready as this is easier to interpret.

      Remember, CPU Ready is contention scheduling onto physical cores, CPU Wait is when the CPU is trying to process and waiting for storage (ie: Storage contention).

  8. Thanks Josh,

    This information was a life saver during a recent VM performance crisis.

    I’m interested in your opinion regarding the recommendations in the following article re: setting the vCenter alerting levels according to the number of vCPUs allocated to the VM:

    http://en.community.dell.com/techcenter/virtualization/infrastructure/b/storage-blog/archive/2013/02/18/configuring-cpu-ready-alarms-in-vcenter

    Thanks again,

    Barry Sermon

  9. Hi Josh. I’m learning a lot here. Is it reasonable to expect a lightly loaded server exhibiting high average CPU Ready times (greater than 200ms) will see lower CPU Ready times as the load increases?

    • Hi Jeff,

      For lightly loaded servers, I would expect to see almost zero CPU ready.

      Have you checked your overcommitment via my vSphere cluster calculator?

      http://www.joshodgers.com/vsphere-cluster-sizing-calculator/

      I’d be interested to hear what your overcommitment ratio is.

      Double check the Performance Tab, Under “Overview” and see what the realtime graph tells you, as it measures CPU ready in percentage, which is an easier way to monitor Ready time IMO. But 200ms is quite high, most of my management style clusters with light load are <10ms

  10. If i reserve 100% of the CPU on a VM, will that not eliminate any CPU ready contention issues on a VM? Ie: a 4 core VM at 2 GHZ a core, i reserve 8 GHZ on that VM, would this not eliminate contention on this VM regardless of how overallocated the host is?

    • Reservations only guarantee Mhz/Ghz once the vCPU/s have been scheduled onto a physical/logical core. A reservation does not guarantee a vCPU will ever be scheduled. Excessive overcommitment will impact the cpu scheduler regardless of reservations. I would argue if you have reasonable levels of CPU overcommitment, CPU reservations have little/no value.