Thursday July 19, 2007 Recently I was investigating a very interesting CPU Caps test failure. One of the tests was failing when it was running early during zone boot. It turned out to be a very interesting bug, indeed. For once, it was a great exercise in using DTrace for solving complicated problem and it also exposed a generic weakness of Solaris scheduler. The bug is 6577453 Java CPU hogs can escape CPU Caps enforcement by sleeping a lot.
In a nutshell, a program may behave in such a way that it is never seen on CPU by clock() thread while it uses a noticeable chunk of CPU resources. CPU Caps were accurate in charging its CPU time (see the description of CPU Caps accounting mechanism), the policing was done by clock(). As a result some threads enjoyed unlimited access to CPU resources while blocking everyone else. To test the fix I wrote a little nasty program that demonstrates the problem.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <poll.h>
#include <sys/time.h>
#define NSEC_IN_MSEC (NANOSEC / MILLISEC)
#define DELTA (9 * NSEC_IN_MSEC)
int f(int x)
{
return (x+1);
}
int main(int argc, char *argv[])
{
for (;;) {
int i;
int y = 0;
hrtime_t t1, t2;
int rc = poll(NULL, 0, 1);
t1 = gethrtime();
t2 = t1 + DELTA;
while (gethrtime() < t2) {
y += f(i);
}
}
}
The program consumes 50% of a CPU and is never seen by a clock() thread. It starts after the clock tick and goes to sleep right before the next one. Running it demonstrates the general problem with Solaris scheduling class implementation - while all the micro-state accounting information is available it is not actually used by the scheduler. This program always runs at priority 59, defeating priority mechanisms of the time-sharing class.
Side note - I was unable to write the same program in Perl. It turned out that IO::Poll module converts the poll(2) timeout value to seconds by always multiplying it to 1000.As many of you know, once you are involved in something, related factors start flying your way. Indeed, a few days after I investigated this problem, slashdot advertised the paper by Dan Tsafrir, Yoav Etsion and Dror Feitelson, describing the very same issue!
Seems like now is the time to fix this old standing problem in Solaris. I opened bug 6582502 Threads may hide behind the clock.
The CPU Caps project is now integrated in OpenSolaris and now I am working on back-porting it to S10 update. I think this is a good time to put some notes regarding its implementation details. The Implementation guide gives a good high-level overview, so here I'd like to concentrate on the bottom-up view.
Before we can penalize CPU usage of some threads we need to know how much CPU is consumed by every project. There are two main approaches available - sampling and monitoring. The difference can be illustrated using the freeway speed control example. Imagine that local police decided to crack down on the freeway speeders (in fact, this is exactly what happened in San Jose). The common method is to hide police vehicle in the bushes and wait until some unlucky schmuck races by. The chances of getting every speeder are not very high but over some period of time the method works because enough speeders would be eventually caught. Some lucky ones, though, would miss a chance meeting with a friendly policeman.
Another approach is to tag each car when it enters and exits freeway with the location and the time. Assuming that the speed is more or less constant it is easy to calculate the average speed and penalise speeding car when it exits the freeway. This method provides for much greater accuracy.
Using sampling we can periodically check what threads are running on each CPU and interpolate their CPU usage from that. For example, once every clock tick we find all threads running on a CPU and charge them 1 clock tick worth of CPU time. Some threads may have just arrived on CPU while others may be sitting there longer, but for long-running threads we should get a good enough estimation. This is the simplest approach and it was used in the initial CPU Caps prototype. The main trouble is that one tick is quite a long time on modern super-fast CPUs and a lot of thread activity may happen in the meantime.
Thread monitoring allows us to know exactly how much CPU time was consumed by a CPU. We do this by marking the time a thread boarded a CPU and left it. Since we are only interested in short-turn CPu usage (over a tick) we also need to check those running on a CPU and get their on-CPU time as well.
Solaris kindly provides us a convenient tool for such purposes, called micro-state accounting. It uses very accurate nanosecond-granularity timestamps whenever thread changes its states. The CPU Caps code uses this facility to calculate CPU usage of each thread. This is done by the mstate_thread_onproc_time() routine:
mstate_thread_onproc_time(kthread_t *t)
{
hrtime_t aggr_time;
hrtime_t now;
hrtime_t state_start;
struct mstate *ms;
klwp_t *lwp;
int mstate;
/* Ignore kernel threads */
if ((lwp = ttolwp(t)) == NULL)
return (0);
/* Get the current thread state */
mstate = t->t_mstate;
ms = &lwp->lwp_mstate;
/* time when thread entered this state */
state_start = ms->ms_state_start;
/* Thread's user + system + trap time */
aggr_time = ms->ms_acct[LMS_USER] +
ms->ms_acct[LMS_SYSTEM] + ms->ms_acct[LMS_TRAP];
/* current time */
now = gethrtime_unscaled();
/*
* NOTE: gethrtime_unscaled on X86 taken on different CPUs is
* inconsistent, so it is possible that now < state_start.
*/
if ((mstate == LMS_USER || mstate == LMS_SYSTEM ||
mstate == LMS_TRAP) && (now > state_start)) {
/* Add time spent on CPU in the current state */
aggr_time += now - state_start;
}
scalehrtime(&aggr_time);
return (aggr_time);
}
This function returns the time spent on CPU by user-land threads since their
birth. The t->t_lwp->lwp_mstate.ms_acct array contains aggregate time spent
by thread in each of the possible states:
LMS_USER - running in user mode
LMS_SYSTEM - running in system call or page fault
LMS_TRAP - running in other trap
LMS_TFAULT - asleep in user text page fault
LMS_DFAULT - asleep in user data page fault
LMS_KFAULT - asleep in kernel page fault
LMS_USER_LOCK - asleep waiting for user-mode lock
LMS_SLEEP - asleep for any other reason
LMS_WAIT_CPU - waiting for CPU (latency)
LMS_STOPPED - stopped (/proc, jobcontrol, lwp_suspend)
The function above is the foundation of the thread accounting done by CPU
caps. The CPU-caps specific monitoring is implemented by each scheduling class
which supports CPU caps. For each thread scheduling classes keep a little
caps_charge_adjust
function via
the cpucaps_charge.
The caps_charge_adjust function calculates the time spent on CPU
since a thread was last checked and updates its total on-CPU time. We will
take a closer look at it next time.
[ Technorati: Solaris ]
( Jun 19 2007, 06:16:28 PM PDT ) PermalinkCaps revised What's new with CPU caps - observability! [Read More] ( Nov 21 2006, 06:25:03 PM PST ) Permalink
Using ZFS to save space on laptop Why ZFS is useful on laptops? It saves disk space! [Read More] ( Sep 14 2006, 06:41:11 PM PDT ) Permalink
Now that the NUMA observability tools are integrated, I made some progress on the CPU caps. I was mostly scribbling documentation and produced a few useful pages:
I also managed to do some actual coding and fixed the nasty locking issue which could leave threads left on the wait queues forever when caps are disabled. I still have a few more issues to attack.
In the process I filed a few new resource management bugs/RFEs:
Exchanging hats and wearing a CPU cap. CPU Caps update [Read More] ( Aug 21 2006, 05:07:26 PM PDT ) Permalink Comments [1]
Solaris 10 and MIB counters Solaris 10 provides a couple of new ways to look at various networking-related MIB2 counters.
One is via kstat(1M) command. Try the following on Solaris 10 system:
$ kstat -c mib2
module: icmp instance: 0
name: rawip class: mib2
crtime 58.54669218
inCksumErrs 0
inDatagrams 204641
inErrors 0
outDatagrams 19
outErrors 0
snaptime 274094.194688545
...
module: sctp instance: 0
name: sctp class: mib2
crtime 58.446876953
sctpAborted 0
sctpActiveEstab 0
sctpChecksumError 0
sctpCurrEstab 0
...
module: udp instance: 0
name: udp class: mib2
crtime 58.522156463
entry6Size 48
entrySize 20
inDatagrams 546726
inErrors 0
outDatagrams 470496
outErrors 0
snaptime 274094.224208608
(Note the new SCTP counters useful if you use the new SCTP protocol).
You can get just IP counters by asking
$ kstat -m ip -c mib2
Or you can watch as they grow over time:
$ kstat -p 'ip::icmp:inMsgs' 1
ip:0:icmp:inMsgs 155322
ip:0:icmp:inMsgs 155322
ip:0:icmp:inMsgs 155322
ip:0:icmp:inMsgs 155324
ip:0:icmp:inMsgs 155326
...
This mechanism is more efficient than
netstat -s command.
Due to the way netstat is working, running netstat too often may actually hurt your system performance.
Now if you are really interested why some of these counters are incremented, DTrace
mib provider may be quite useful.
For example, what processes write to the network?
$ dtrace -n 'mib:ip::tcpOutSegs{@[execname]=count()}'
dtrace: description 'mib:ip::tcpOutSegs' matched 1 probe
^C
nfs4cbd 1
firefox-bin 1
ftp 4
sched 6
( Sep 24 2004, 07:59:08 PM PDT ) Permalink Comments [1]
Kernel TODO lists (a.k.a. task queues) Solaris 10 comes with a new interface for kernel developers: task queues. This interface allows you to execute some job later in the context of completely different thread. Why would you need to do that? For example, you driver has to wait for some resource to become available but you can't wait (may be driver code is executing in interrupt context or is holding some lock). In this case you can wait and complete processing in a different context where it is safe.
The interface is fully described in the taskq(9f) man page. Here is a snippet:
#include <sys/sunddi.h>
ddi_taskq_t *ddi_taskq_create(dev_info_t *dip, const char
*name, int nthreads, pri_t pri, uint_t cflags);
void ddi_taskq_destroy(ddi_taskq_t *tq);
int ddi_taskq_dispatch(ddi_taskq_t *tq, void (*func)(void*), void *arg, uint_t dflags);
void ddi_taskq_wait(ddi_taskq_t *tq);
void ddi_taskq_suspend(ddi_taskq_t *tq);
boolean_t ddi_taskq_suspended(ddi_taskq_t *tq);
void ddi_taskq_resume(ddi_taskq_t *tq);
For example, your driver may create a task queue in its attach(9E)
routine:
static ddi_taskq_t xxx_taskq;
int xxx_attach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
switch (cmd) {
case DDI_ATTACH:
xxx_taskq = ddi_taskq_create(dip, "xxx_useful_taskq", 1, TASKQ_DEFAULTPRI, 0);
...
break;
}
Then, somewhere in the guts of the driver code you may schedule a remote taskq:
typedef struct my_data {
kmutex_t drv_lock;
kcondvar_t drv_cv;
boolean_t drv_ready;
} my_data_t;
void xxx_complete_request(void *data)
{
my_data_t *arg = (my_data_t *)data;
mutex_enter(&arg->drv_lock);
while (!arg->drv_ready) {
cv_wait(&arg->drv_cv, &arg->drv_lock);
}
xxx_complete_request_processing(arg);
}
void xxx_intr(void *data)
{
...
if (ddi_taskq_dispatch(xxx_taskq, xxx_complete_request, data, DDI_NOSLEEP) != DDI_SUCCESS) {
recover_from_memory_shortage();
}
}
As a result, the driver interrupt routine will execute xxx_complete_request() at some later time in
a context of completely separate thread.
Note that if some task blocks waiting for something, it will block all tasks queued behind it,
so you should be careful to avoid deadlocks.
Also note that if you create a task queue with a single thread, it will guarantee that all tasks are processed sequentially in order posted. Otherwise an order is not guaranteed. How can you use it in your drier? Hint: it is a DDI-compliant (and DR-friendly) way to create threads needed by your driver code. ( Sep 08 2004, 07:13:06 PM PDT ) Permalink
Testing and Debugging. I am always interested in how engineers come to their final products. This may be an interesting case study of a problem I encountered recently.
I am reimplementing a driver responsible for AF_UNIX sockets. It should solve serious scalability problems with the current "tl" driver.
To test it I wrote a special DTrace script that traces all entry points to the driver and provides some useful information. When something goes wrong I can look at the trace and figure out what is happening.
To test the driver I've written some simple tests in C which perform basic operations with sockets: bind(), listen(), connect(), accept. Everything works fine. Then I looked at Perl Test::More module and decided to write some simple tests in Perl. Here is the full text of the test:
use strict;
use Fcntl;
use Socket;
use File::Temp;
use Test::More tests => 5;
my $type = SOCK_STREAM;
my $name = "SOCK_STREAM";
my $socket;
my $listener;
my $acceptor;
my $peer;
my $pname;
my $res;
my $sname = tmpnam();
socket(Server, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
socket (Client, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
socket (Acceptor, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
unlink ($sname);
my $uaddr = sockaddr_un($sname);
$res = bind(Server, $uaddr) or die ("bind: $!");
ok($res, "bind");
$res = listen(Server, 1) or die ("listen: $!");
ok($res, "listen");
$res = connect (Client, $uaddr) or die ("connect: $!");
ok($res, "connect");
accept (Acceptor, Server) or die ("accept: $!");
unlink ($sname);
Suddenly the test failed on accept() with ESPIPE (illegal seek) error!
The C equivalent of the test worked fine. So I got two machines - one without
my driver and one with it and run the test with the monitoring D scripts.
Spending quite some time looking at the results, I didn't find any differences.
So I run truss and saw that accept actually succeeded, but for some reason Perl
version failed.Playing with truss(1) options I produced a huge file tracing calls to libc and libsocket. Everything looked fine with accept itself, but then, for some reason Perl version started to do completely different things. At that point I needed some other victim, so I headed to Bart and we started to look at these two truss outputs together. Bart spotted that the "good" run did malloc() of 17 bytes after accept while the bad one used malloc for one byte. This quickly lead us to conclusion that the problem is in the peer address returned in the accept(3SOCKET) call:
int accept(int s, struct sockaddr *addr, socklen_t *addrlen);
Indeed, a short C program immediately showed that returned addrlen was zero!Going back to the driver source everything seems normal. Some additional DTrace hacking showed exact message received by socket layer that should contain the peer address. Dumping the mblk in DTace shows that instead of options it contains a pointer! Another look at the source code reveals the problem:
t_scalar_t olen = creq->OPT_length;
t_scalar_t ooff = creq->DST_offset;
Here is the bug! Instead of using OPT_offset it uses DST_offset.
It took quite a while to figure this one out.Here is a snippet from the D that helped to track the problem:
fbt:tl:tl_conn_req:entry
{
this->q = args[0];
self->trace = 1;
this->mp = args[1];
this->connreq = (struct T_conn_req *)(this->mp)->b_rptr;
printf ("prim=%d, len=%d, off=%d, olen=%d, ooff=%d",
this->connreq->PRIM_type,
this->connreq->DEST_length,
this->connreq->DEST_offset,
this->connreq->OPT_length,
this->connreq->OPT_offset);
}
fbt::putnext:entry
/self->trace &&
args[1]->b_datap->db_type == 1 &&
*(int*)(args[1]->b_rptr) == 11/
{
self->trace = 0;
this->mp = args[1];
this->tci = (struct T_conn_ind*)this->mp->b_rptr;
this->opt = (struct T_opthdr *)(this->mp->b_rptr + this->tci->OPT_offset);
printf("conn_ind: mp=%p, sz=%d, ol=%d, oo=%d ",
args[1],
msgsize(this->mp),
this->tci->OPT_length,
this->tci->OPT_offset);
printf("olemn=%d, olev=%d, on=%d, os=%d",
this->opt->len,
this->opt->level,
this->opt->name,
this->opt->status);
}
I think it is a good example of how the process works inside. It really helps to use different tools at hand
(and extra pair of eyes!).
While playing with Perl I discovered that it uses fdopen() for all perl File objects, so it is impossible to open more then 256 open sockets (actually 253 if we account for STDIN, STDOUT and STDERR). Here C is still more powerful then Perl.
( Jul 23 2004, 08:24:32 PM PDT ) Permalink