The ideal level of CPU utilization in a server is an interesting topic. Google (and other search engines) find a number of different web pages that advocate that 100% CPU utilization is ideal, CPU utilization at 95% is likely catastrophic, significant queuing for CPU time begins when the CPUs are 75% to 80% busy, as well as a number of other interesting nuggets of information. It is important to keep in mind that at any one instant, a CPU (or core or CPU instruction thread) is either 100% busy or 0% busy – at any one instant a CPU cannot be 75% busy. The 75% or 95% utilization figures found on various web sites, in books, and in presentations are actually an average utilization between two points in time – whether those two points in time represent 0.000001 seconds, 24 hours, or somewhere in between could be very important when trying to determine if there is an excessive CPU utilization issue that causes service level agreement problem (or “slowness” problems reported by end-users).
Assume that in a one minute time period, the CPU utilization in a server is 75% – is that suitable, or is that undesirable, or not enough information is available to make an educated guess? Good !? Maybe good? Bad? Keep in mind that the CPU utilization is an average utilization between a starting time point and an ending time point – much like with a Statspack/AWR report, if you look at too large of a time period, significant problems may be masked (hidden from view) when the statistics from the time intervals containing problems are averaged over a long time period. The 75% CPU utilization could indicate that for every three of four points in time the CPU had work that needed to be performed. The 75% CPU utilization might also indicate that there was intense competition for the CPU time by many tasks for the first 45 seconds, followed by a complete absence of the need for CPU time in the last 15 seconds of the one minute time period. For the many tasks competing for CPU time in the first 45 seconds, what might normally complete in one second might have actually required close to 45 seconds due to the operating system attempting to allocate portions of the server’s CPU time to each of the tasks that needed to use the CPU. The tasks queue up while waiting for their turn for processing, in what is known as the CPU run queue. As more processes enter the run queue, it takes longer for each process to perform each unit of their normal processing. This is where the topic of queuing theory becomes very important. Two very helpful books that discuss queuing theory as it applies to Oracle Database functionality are “Optimizing Oracle Performance” (by Cary Millsap with Jeff Holt) and “Forecasting Oracle Performance” (by Craig Shallahamer). (Note: This example used one minute as the time interval for measuring CPU utilization in order to rationalize the competition for CPU resources into terms that are easily understood – assuming that a given 3GHz processor is only able to perform one operation at a time, that processor is capable of performing 3,000,000,000 operations per second – 180,000,000,000 operations in that one minute.)
There are a couple of different formulas used in queuing theory, including the Erlang C function, Little’s Law, and Kendall’s Notation. I will not go into significant detail here on the different queuing theory models, but I will provide a simple example. Assume that you enter a grocery store that has 10 checkout lanes (think of this like 10 CPUs in a database server). When it is time to pay for the items in your cart, a person working for the store directs you into one of the 10 checkout lanes. If anyone else is directed into the same checkout lane as you, you will need to alternate with that person at the checkout counter every 10 seconds – when your 10 second time period is up, you will need to load up everything placed on the conveyor belt and allow the other person to unload their items on the belt to use the checkout lane for 10 seconds (this loading and unloading of items could be time consuming). If anyone else is directed into your checkout lane, that person will also be able to use the checkout counter for 10 second intervals. In short order, what would have required 5 minutes to complete is now requiring 30 minutes. If the line grows too long in one checkout lane, there might be a chance to jump into a different checkout lane used by fewer people, possibly once a minute (some Linux operating systems will potentially move a process from one CPU to a less busy CPU every 200ms). Jumping into a different checkout lane not only allows you to check out faster, but also allows the people who remain in the original line to check out faster. The above is a very rough outline of queuing theory. If the customer expects to check out in no more than 10 minutes, how many lanes are necessary, given that the customers arrive at a random rate, and we must meet the target 99% of the time.
CPU queuing is not a linear problem – 100% CPU utilization is not twice as bad as 50% CPU utilization, it is much worse than that. Some of the articles below explain this concept very well – a Google search found a couple of interesting articles/presentations that computer science professors assembled for various classes – you might find it interesting to read some of those documents that are found in the .edu domain (it appears that none of those links made it into this blog article). Some operating systems use a single run queue (for instance, Windows, and Linux prior to the 2.6 kernel release), with the end result of effectively evenly distributing the CPU load between CPUs, causing the processes to constantly jump from one CPU to another (this likely reduces the effectiveness of the CPU caches – pulling everything off the conveyor belt in the analogy). Other operating systems have a separate run queue for each CPU, which keeps the process running on the same CPU. Quick quiz: If our 10 CPU server in this example has a run queue of 10 – does that mean that one process is in each of the 10 CPU run queues, or is it possible that all 10 processes will be in just one of the 10 run queues, or possibly something in between those two extremes? Are all three scenarios equally good or equally bad?
Keep in mind that while sessions are in “wait events” that does not mean that the sessions are not consuming server CPU time. A session in an Oracle wait event might motivate a significant amount of system (kernel) mode CPU time on behalf of the session. Sending/receiving data through the network, disk accesses, inspection of the current date/time, and even reading eight bytes (a 64 bit word) from memory motivates the use of the server’s CPU time. CPU saturation may lead to latch contention (note that latch contention may also lead to CPU saturation due to sessions spinning while attempting to acquire a latch), long-duration log file waits (log file sync, log file parallel write), cluster-related waits, increased duration of single-block and multiblock reads, and significant increases in server response time.