Tuesday June 19, 2007 The CPU Caps project is now integrated in OpenSolaris and now I am working on back-porting it to S10 update. I think this is a good time to put some notes regarding its implementation details. The Implementation guide gives a good high-level overview, so here I'd like to concentrate on the bottom-up view.
Before we can penalize CPU usage of some threads we need to know how much CPU is consumed by every project. There are two main approaches available - sampling and monitoring. The difference can be illustrated using the freeway speed control example. Imagine that local police decided to crack down on the freeway speeders (in fact, this is exactly what happened in San Jose). The common method is to hide police vehicle in the bushes and wait until some unlucky schmuck races by. The chances of getting every speeder are not very high but over some period of time the method works because enough speeders would be eventually caught. Some lucky ones, though, would miss a chance meeting with a friendly policeman.
Another approach is to tag each car when it enters and exits freeway with the location and the time. Assuming that the speed is more or less constant it is easy to calculate the average speed and penalise speeding car when it exits the freeway. This method provides for much greater accuracy.
Using sampling we can periodically check what threads are running on each CPU and interpolate their CPU usage from that. For example, once every clock tick we find all threads running on a CPU and charge them 1 clock tick worth of CPU time. Some threads may have just arrived on CPU while others may be sitting there longer, but for long-running threads we should get a good enough estimation. This is the simplest approach and it was used in the initial CPU Caps prototype. The main trouble is that one tick is quite a long time on modern super-fast CPUs and a lot of thread activity may happen in the meantime.
Thread monitoring allows us to know exactly how much CPU time was consumed by a CPU. We do this by marking the time a thread boarded a CPU and left it. Since we are only interested in short-turn CPu usage (over a tick) we also need to check those running on a CPU and get their on-CPU time as well.
Solaris kindly provides us a convenient tool for such purposes, called micro-state accounting. It uses very accurate nanosecond-granularity timestamps whenever thread changes its states. The CPU Caps code uses this facility to calculate CPU usage of each thread. This is done by the mstate_thread_onproc_time() routine:
mstate_thread_onproc_time(kthread_t *t)
{
hrtime_t aggr_time;
hrtime_t now;
hrtime_t state_start;
struct mstate *ms;
klwp_t *lwp;
int mstate;
/* Ignore kernel threads */
if ((lwp = ttolwp(t)) == NULL)
return (0);
/* Get the current thread state */
mstate = t->t_mstate;
ms = &lwp->lwp_mstate;
/* time when thread entered this state */
state_start = ms->ms_state_start;
/* Thread's user + system + trap time */
aggr_time = ms->ms_acct[LMS_USER] +
ms->ms_acct[LMS_SYSTEM] + ms->ms_acct[LMS_TRAP];
/* current time */
now = gethrtime_unscaled();
/*
* NOTE: gethrtime_unscaled on X86 taken on different CPUs is
* inconsistent, so it is possible that now < state_start.
*/
if ((mstate == LMS_USER || mstate == LMS_SYSTEM ||
mstate == LMS_TRAP) && (now > state_start)) {
/* Add time spent on CPU in the current state */
aggr_time += now - state_start;
}
scalehrtime(&aggr_time);
return (aggr_time);
}
This function returns the time spent on CPU by user-land threads since their
birth. The t->t_lwp->lwp_mstate.ms_acct array contains aggregate time spent
by thread in each of the possible states:
LMS_USER - running in user mode
LMS_SYSTEM - running in system call or page fault
LMS_TRAP - running in other trap
LMS_TFAULT - asleep in user text page fault
LMS_DFAULT - asleep in user data page fault
LMS_KFAULT - asleep in kernel page fault
LMS_USER_LOCK - asleep waiting for user-mode lock
LMS_SLEEP - asleep for any other reason
LMS_WAIT_CPU - waiting for CPU (latency)
LMS_STOPPED - stopped (/proc, jobcontrol, lwp_suspend)
The function above is the foundation of the thread accounting done by CPU
caps. The CPU-caps specific monitoring is implemented by each scheduling class
which supports CPU caps. For each thread scheduling classes keep a little
caps_charge_adjust
function via
the cpucaps_charge.
The caps_charge_adjust function calculates the time spent on CPU
since a thread was last checked and updates its total on-CPU time. We will
take a closer look at it next time.
[ Technorati: Solaris ]
( Jun 19 2007, 06:16:28 PM PDT ) Permalink