Weblog

All | CMT | General | NUMA | OpenSolaris | Perl | Photo | Programmers Desk | STREAMS
20070719 Thursday July 19, 2007

Hiding behind the clock

Hiding behind the clock

Recently I was investigating a very interesting CPU Caps test failure. One of the tests was failing when it was running early during zone boot. It turned out to be a very interesting bug, indeed. For once, it was a great exercise in using DTrace for solving complicated problem and it also exposed a generic weakness of Solaris scheduler. The bug is 6577453 Java CPU hogs can escape CPU Caps enforcement by sleeping a lot.

In a nutshell, a program may behave in such a way that it is never seen on CPU by clock() thread while it uses a noticeable chunk of CPU resources. CPU Caps were accurate in charging its CPU time (see the description of CPU Caps accounting mechanism), the policing was done by clock(). As a result some threads enjoyed unlimited access to CPU resources while blocking everyone else. To test the fix I wrote a little nasty program that demonstrates the problem.


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <poll.h>
#include <sys/time.h>

#define NSEC_IN_MSEC (NANOSEC / MILLISEC)
#define DELTA (9 * NSEC_IN_MSEC)

int f(int x)
{
	return (x+1);
}

int main(int argc, char *argv[])
{
	for (;;) {
		int i;
		int y = 0;
		hrtime_t t1, t2;
		int rc = poll(NULL, 0, 1);

		t1 = gethrtime();
		t2 = t1 + DELTA;

		while (gethrtime() < t2) {
			y += f(i);
		}
	}
}

The program consumes 50% of a CPU and is never seen by a clock() thread. It starts after the clock tick and goes to sleep right before the next one. Running it demonstrates the general problem with Solaris scheduling class implementation - while all the micro-state accounting information is available it is not actually used by the scheduler. This program always runs at priority 59, defeating priority mechanisms of the time-sharing class.

Side note - I was unable to write the same program in Perl. It turned out that IO::Poll module converts the poll(2) timeout value to seconds by always multiplying it to 1000.
As many of you know, once you are involved in something, related factors start flying your way. Indeed, a few days after I investigated this problem, slashdot advertised the paper by Dan Tsafrir, Yoav Etsion and Dror Feitelson, describing the very same issue!

Seems like now is the time to fix this old standing problem in Solaris. I opened bug 6582502 Threads may hide behind the clock.


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

( Jul 19 2007, 04:40:11 PM PDT ) Permalink

20070619 Tuesday June 19, 2007

CPU Caps: thread accounting

CPU Caps: thread accounting

The CPU Caps project is now integrated in OpenSolaris and now I am working on back-porting it to S10 update. I think this is a good time to put some notes regarding its implementation details. The Implementation guide gives a good high-level overview, so here I'd like to concentrate on the bottom-up view.

Before we can penalize CPU usage of some threads we need to know how much CPU is consumed by every project. There are two main approaches available - sampling and monitoring. The difference can be illustrated using the freeway speed control example. Imagine that local police decided to crack down on the freeway speeders (in fact, this is exactly what happened in San Jose). The common method is to hide police vehicle in the bushes and wait until some unlucky schmuck races by. The chances of getting every speeder are not very high but over some period of time the method works because enough speeders would be eventually caught. Some lucky ones, though, would miss a chance meeting with a friendly policeman.

Another approach is to tag each car when it enters and exits freeway with the location and the time. Assuming that the speed is more or less constant it is easy to calculate the average speed and penalise speeding car when it exits the freeway. This method provides for much greater accuracy.

Using sampling we can periodically check what threads are running on each CPU and interpolate their CPU usage from that. For example, once every clock tick we find all threads running on a CPU and charge them 1 clock tick worth of CPU time. Some threads may have just arrived on CPU while others may be sitting there longer, but for long-running threads we should get a good enough estimation. This is the simplest approach and it was used in the initial CPU Caps prototype. The main trouble is that one tick is quite a long time on modern super-fast CPUs and a lot of thread activity may happen in the meantime.

Thread monitoring allows us to know exactly how much CPU time was consumed by a CPU. We do this by marking the time a thread boarded a CPU and left it. Since we are only interested in short-turn CPu usage (over a tick) we also need to check those running on a CPU and get their on-CPU time as well.

Solaris kindly provides us a convenient tool for such purposes, called micro-state accounting. It uses very accurate nanosecond-granularity timestamps whenever thread changes its states. The CPU Caps code uses this facility to calculate CPU usage of each thread. This is done by the mstate_thread_onproc_time() routine:


mstate_thread_onproc_time(kthread_t *t)
{
	hrtime_t aggr_time;
	hrtime_t now;
	hrtime_t state_start;
	struct mstate *ms;
	klwp_t *lwp;
	int	mstate;

        /* Ignore kernel threads */
	if ((lwp = ttolwp(t)) == NULL)
		return (0);

        /* Get the current thread state */
	mstate = t->t_mstate;
	ms = &lwp->lwp_mstate;
        /* time when thread entered this state */
	state_start = ms->ms_state_start;

        /* Thread's user + system + trap time */
	aggr_time = ms->ms_acct[LMS_USER] +
	    ms->ms_acct[LMS_SYSTEM] + ms->ms_acct[LMS_TRAP];

        /* current time */
	now = gethrtime_unscaled();

	/*
	 * NOTE: gethrtime_unscaled on X86 taken on different CPUs is
	 * inconsistent, so it is possible that now < state_start.
	 */
	if ((mstate == LMS_USER || mstate == LMS_SYSTEM ||
		mstate == LMS_TRAP) && (now > state_start)) {
                        /* Add time spent on CPU in the current state */
			aggr_time += now - state_start;
	}

	scalehrtime(&aggr_time);
	return (aggr_time);
}

This function returns the time spent on CPU by user-land threads since their birth. The t->t_lwp->lwp_mstate.ms_acct array contains aggregate time spent by thread in each of the possible states:

The function above is the foundation of the thread accounting done by CPU caps. The CPU-caps specific monitoring is implemented by each scheduling class which supports CPU caps. For each thread scheduling classes keep a little structure which contains the total time spent on CPU during thread lifetime. Every time thread leaves a CPU and also every tick, scheduling class code calls caps_charge_adjust function via the cpucaps_charge. The caps_charge_adjust function calculates the time spent on CPU since a thread was last checked and updates its total on-CPU time. We will take a closer look at it next time.


[ Technorati: ]

( Jun 19 2007, 06:16:28 PM PDT ) Permalink

20061121 Tuesday November 21, 2006

Caps revised What's new with CPU caps - observability! [Read More] ( Nov 21 2006, 06:25:03 PM PST ) Permalink

20060914 Thursday September 14, 2006

Using ZFS to save space on laptop Why ZFS is useful on laptops? It saves disk space! [Read More] ( Sep 14 2006, 06:41:11 PM PDT ) Permalink

20060911 Monday September 11, 2006

CPU Caps update

Now that the NUMA observability tools are integrated, I made some progress on the CPU caps. I was mostly scribbling documentation and produced a few useful pages:

I also managed to do some actual coding and fixed the nasty locking issue which could leave threads left on the wait queues forever when caps are disabled. I still have a few more issues to attack.

In the process I filed a few new resource management bugs/RFEs:

6466380
Project resource set callbacks are needlessly called on every fork()
6468003
prctl should support the notion of default and infinity
6468451
Errors from setting resource controls should propagate to the caller
( Sep 11 2006, 05:39:53 PM PDT ) Permalink Comments [1]

20060821 Monday August 21, 2006

Exchanging hats and wearing a CPU cap. CPU Caps update [Read More] ( Aug 21 2006, 05:07:26 PM PDT ) Permalink Comments [1]

20040924 Friday September 24, 2004

Solaris 10 and MIB counters Solaris 10 provides a couple of new ways to look at various networking-related MIB2 counters.

One is via kstat(1M) command. Try the following on Solaris 10 system:


$ kstat -c mib2

module: icmp                            instance: 0     
name:   rawip                           class:    mib2                          
        crtime                          58.54669218
        inCksumErrs                     0
        inDatagrams                     204641
        inErrors                        0
        outDatagrams                    19
        outErrors                       0
        snaptime                        274094.194688545
...

module: sctp                            instance: 0     
name:   sctp                            class:    mib2                          
        crtime                          58.446876953
        sctpAborted                     0
        sctpActiveEstab                 0
        sctpChecksumError               0
        sctpCurrEstab                   0

...
module: udp                             instance: 0     
name:   udp                             class:    mib2                          
        crtime                          58.522156463
        entry6Size                      48
        entrySize                       20
        inDatagrams                     546726
        inErrors                        0
        outDatagrams                    470496
        outErrors                       0
        snaptime                        274094.224208608


(Note the new SCTP counters useful if you use the new SCTP protocol).

You can get just IP counters by asking


$ kstat -m ip -c mib2

Or you can watch as they grow over time:


$ kstat -p 'ip::icmp:inMsgs'  1
ip:0:icmp:inMsgs        155322

ip:0:icmp:inMsgs        155322

ip:0:icmp:inMsgs        155322

ip:0:icmp:inMsgs        155324

ip:0:icmp:inMsgs        155326
...

This mechanism is more efficient than netstat -s command. Due to the way netstat is working, running netstat too often may actually hurt your system performance. Now if you are really interested why some of these counters are incremented, DTrace mib provider may be quite useful. For example, what processes write to the network?

$ dtrace -n 'mib:ip::tcpOutSegs{@[execname]=count()}'
dtrace: description 'mib:ip::tcpOutSegs' matched 1 probe
^C
  nfs4cbd                                                           1
  firefox-bin                                                       1
  ftp                                                               4
  sched                                                             6

( Sep 24 2004, 07:59:08 PM PDT ) Permalink Comments [1]

20040908 Wednesday September 08, 2004

Kernel TODO lists (a.k.a. task queues) Solaris 10 comes with a new interface for kernel developers: task queues. This interface allows you to execute some job later in the context of completely different thread. Why would you need to do that? For example, you driver has to wait for some resource to become available but you can't wait (may be driver code is executing in interrupt context or is holding some lock). In this case you can wait and complete processing in a different context where it is safe.

The interface is fully described in the taskq(9f) man page. Here is a snippet:


     #include <sys/sunddi.h>

     ddi_taskq_t *ddi_taskq_create(dev_info_t  *dip,  const  char
     *name, int nthreads, pri_t  pri, uint_t  cflags);

     void ddi_taskq_destroy(ddi_taskq_t *tq);

     int ddi_taskq_dispatch(ddi_taskq_t *tq, void  (*func)(void*), void *arg, uint_t dflags);

     void ddi_taskq_wait(ddi_taskq_t *tq);

     void ddi_taskq_suspend(ddi_taskq_t *tq);

     boolean_t ddi_taskq_suspended(ddi_taskq_t *tq);

     void ddi_taskq_resume(ddi_taskq_t *tq);

For example, your driver may create a task queue in its attach(9E) routine:

static ddi_taskq_t xxx_taskq;

int xxx_attach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
        switch (cmd) {
        case DDI_ATTACH:
                xxx_taskq = ddi_taskq_create(dip, "xxx_useful_taskq", 1, TASKQ_DEFAULTPRI, 0);
                ...
                break;
    
}

Then, somewhere in the guts of the driver code you may schedule a remote taskq:

typedef struct my_data {
        kmutex_t drv_lock;
        kcondvar_t drv_cv;
        boolean_t drv_ready;
} my_data_t;

void xxx_complete_request(void *data)
{
        my_data_t *arg = (my_data_t *)data;
        mutex_enter(&arg->drv_lock);
        while (!arg->drv_ready) {
              cv_wait(&arg->drv_cv, &arg->drv_lock);
        }
        xxx_complete_request_processing(arg);
}

void xxx_intr(void *data)
{
        ...
        if (ddi_taskq_dispatch(xxx_taskq, xxx_complete_request, data, DDI_NOSLEEP) != DDI_SUCCESS) {
                recover_from_memory_shortage();
        }
}


As a result, the driver interrupt routine will execute xxx_complete_request() at some later time in a context of completely separate thread. Note that if some task blocks waiting for something, it will block all tasks queued behind it, so you should be careful to avoid deadlocks.

Also note that if you create a task queue with a single thread, it will guarantee that all tasks are processed sequentially in order posted. Otherwise an order is not guaranteed. How can you use it in your drier? Hint: it is a DDI-compliant (and DR-friendly) way to create threads needed by your driver code. ( Sep 08 2004, 07:13:06 PM PDT ) Permalink

20040723 Friday July 23, 2004

Testing and Debugging. I am always interested in how engineers come to their final products. This may be an interesting case study of a problem I encountered recently.

I am reimplementing a driver responsible for AF_UNIX sockets. It should solve serious scalability problems with the current "tl" driver.

To test it I wrote a special DTrace script that traces all entry points to the driver and provides some useful information. When something goes wrong I can look at the trace and figure out what is happening.

To test the driver I've written some simple tests in C which perform basic operations with sockets: bind(), listen(), connect(), accept. Everything works fine. Then I looked at Perl Test::More module and decided to write some simple tests in Perl. Here is the full text of the test:

use strict;
use Fcntl;
use Socket;
use File::Temp;
use Test::More tests => 5;
my $type = SOCK_STREAM;
my $name = "SOCK_STREAM";

my $socket;
my $listener;
my $acceptor;
my $peer;
my $pname;
my $res;

my $sname = tmpnam();

socket(Server, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
socket (Client, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
socket (Acceptor, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");

unlink ($sname);

my $uaddr = sockaddr_un($sname);
$res = bind(Server, $uaddr) or die ("bind: $!");
ok($res, "bind");
$res = listen(Server, 1) or die ("listen: $!");
ok($res, "listen");
$res = connect (Client, $uaddr) or die ("connect: $!");
ok($res, "connect");
accept (Acceptor, Server) or die ("accept: $!");

unlink ($sname);
Suddenly the test failed on accept() with ESPIPE (illegal seek) error! The C equivalent of the test worked fine. So I got two machines - one without my driver and one with it and run the test with the monitoring D scripts. Spending quite some time looking at the results, I didn't find any differences. So I run truss and saw that accept actually succeeded, but for some reason Perl version failed.

Playing with truss(1) options I produced a huge file tracing calls to libc and libsocket. Everything looked fine with accept itself, but then, for some reason Perl version started to do completely different things. At that point I needed some other victim, so I headed to Bart and we started to look at these two truss outputs together. Bart spotted that the "good" run did malloc() of 17 bytes after accept while the bad one used malloc for one byte. This quickly lead us to conclusion that the problem is in the peer address returned in the accept(3SOCKET) call:

      int accept(int s, struct sockaddr *addr, socklen_t *addrlen);
Indeed, a short C program immediately showed that returned addrlen was zero!

Going back to the driver source everything seems normal. Some additional DTrace hacking showed exact message received by socket layer that should contain the peer address. Dumping the mblk in DTace shows that instead of options it contains a pointer! Another look at the source code reveals the problem:

        t_scalar_t      olen = creq->OPT_length;
        t_scalar_t      ooff = creq->DST_offset;
Here is the bug! Instead of using OPT_offset it uses DST_offset. It took quite a while to figure this one out.Here is a snippet from the D that helped to track the problem:
fbt:tl:tl_conn_req:entry
{
        this->q = args[0];
        self->trace = 1;
        this->mp = args[1];
        this->connreq = (struct T_conn_req *)(this->mp)->b_rptr;
        printf ("prim=%d, len=%d, off=%d, olen=%d, ooff=%d",
            this->connreq->PRIM_type, 
            this->connreq->DEST_length, 
            this->connreq->DEST_offset, 
            this->connreq->OPT_length,
            this->connreq->OPT_offset);
}

fbt::putnext:entry
/self->trace &&
 args[1]->b_datap->db_type == 1 &&
*(int*)(args[1]->b_rptr) == 11/
{
        self->trace = 0;
        this->mp = args[1];
        this->tci = (struct T_conn_ind*)this->mp->b_rptr;
        this->opt = (struct T_opthdr *)(this->mp->b_rptr + this->tci->OPT_offset);
        printf("conn_ind: mp=%p, sz=%d, ol=%d, oo=%d ",
            args[1],
            msgsize(this->mp),
            this->tci->OPT_length,
            this->tci->OPT_offset);
        printf("olemn=%d, olev=%d, on=%d, os=%d",
            this->opt->len,
            this->opt->level,
            this->opt->name,
            this->opt->status);
}

I think it is a good example of how the process works inside. It really helps to use different tools at hand (and extra pair of eyes!).

Side Note

While playing with Perl I discovered that it uses fdopen() for all perl File objects, so it is impossible to open more then 256 open sockets (actually 253 if we account for STDIN, STDOUT and STDERR). Here C is still more powerful then Perl.

( Jul 23 2004, 08:24:32 PM PDT ) Permalink

Calendar

RSS Feeds

Search

Links

Navigation

Referers