Doing the Jitter Bug
One of the nice things about finally having
OpenSolaris
properly launched is that I can share Solaris source code with everyone
when talking
about Solaris. Since Solaris
is the code, not being able to
show people the code has
been like having a discussion about various aspects of the flavors of
wine without actually tasting any - possible perhaps, but a rather dry
(heh)
and uninteresting narrative....
I was happily (as I don't get to write nearly enough code) whacking
away on a revised version of malloc
to fix some longstanding performance and scalabilty issues when one of
our marketing folks darted in
asking about telcos, real time, and latency bubbles [I'll write about
our malloc findings later on]. After getting the relevant email trail
and a test
program from the customer contact, it turned out that indeed the
customer was seeing a real problem - 20 millisecond timers was
apparently firing up to 60 microseconds late occasionally on his
Opteron box, but no problem was seen on
SPARC machines. What's weirder, this problem appeared sometimes, but
not always
- any given instance of the test
program might display the problem, or it might not.
Hmmm. The customer's test program would carefully store
information about intervals between timer firings that exceeded 40
usecs + 20 msecs and report them every 20 seconds or so... and
when the problem appeared, it was happening once a second...
Like most performance problems I run into, this looked like a job
for DTrace.
As Bryan has
pointed out many times, DTrace is best at quickly evaluating
various hypotheses...so I needed some testable ideas as to what might
be going on. Well, what could make the program's timers fire late?
- Other threads or interrupts - perhaps these are interfering with
our program somehow?
- Other timers firing on this cpu. Maybe some other system
timers are causing problems on CPU 1?
Well, hypothesis testing time. First of all, the test program
clearly
created a processor set containing CPU 1 and placed itself as the only
thread in that set. It even disabled interrupts on
that CPU. A quick check with the following script:
on-cpu
/curcpu->cpu_id == 1/
{
trace(curthread);
}
verified that there was only a single thread (the expected one) running
on CPU 1. Hmmm. interstat verified that CPU 1 wasn't
getting any device interrupts, either. So what's
happening here? I had some experience with time keeping on
Solaris, but not enough to know offhand where to look
for this problem. Confused? DTrace will help! A quick
change to the above script lets us use the
enqueue probe in the sched provider to discover exactly how our process
is made runnable when the timer expires:
sched:::enqueue
/curcpu->cpu_id == 1/
{
stack(10);
}
The output is pretty straightforward - lots of entries that look like
this:
1 2561 setbackdq:enqueue
TS`ts_setrun+0xf1
genunix`cv_unsleep+0x6e
genunix`setrun_locked+0x6d
genunix`eat_signal+0x89
genunix`sigtoproc+0x420
genunix`sigaddqa+0x4a
genunix`timer_fire+0xb9
genunix`clock_highres_fire+0x3f
genunix`cyclic_softint+0xb2
unix`cbe_low_level+0x14
This gave me a place to get started....
timer_fire is clearly important,
so let's instrument that and see when it fires:
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}
fbt::timer_fire:entry
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}
This script just prints a line with a timestamp in microseconds since
we started, with the number of microseconds late (more than the 20
msec) interval that was desired.
Interestingly enough, the output looks like this:
13822657: interval is 0 usecs late
13842656: interval is 0 usecs late
13862657: interval is 0 usecs late
13882657: interval is 0 usecs late
13902657: interval is 0 usecs late
13922599: interval is -57 usecs late
13942657: interval is 57 usecs late
13962657: interval is 0 usecs late
13982657: interval is 0 usecs late
14002657: interval is 0 usecs late
Whoops! Turns out our test program was leading us down the garden
path; the problem is that one timer fire appears to be
early, and the next one makes up the difference; the net error is 0
over time. Reading the comments in
cyclic.c,
this isn't
that surprising but does give us a place to start looking. At
this point, I noticed that I can suspend the test program and
still get the same intermittant late behavior. How often are we late?
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}
fbt::timer_fire:entry
/(int) (((timestamp - last) - 20000000))/1000 != 0/
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
}
fbt::timer_fire:entry
{
last = timestamp;
}
5974796: interval is -57 usecs late
5994854: interval is 56 usecs late
6974796: interval is -57 usecs late
6994854: interval is 56 usecs late
Interesting - once a second.
Reading the
cyclic.c
source code we see that cyclic_fire handles all the interesting timer
stuff. Let's trace cyclic_fire in addition to timer_fire:
2248600: interval is 0 usecs late
2263098: cyclic_fire
2268539: cyclic_fire
2268543: interval is -57 usecs late
2288597: cyclic_fire
2288600: interval is 57 usecs late
So did we misprogram the clock, or is something else going on? On line 929 ofcyclic.c, we see that we reprogram the interrupt source through a function pointer. Those are a pain. Dropping into mdb:
# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba
s1394 nca fctl lofs nfs audiosup random sppp ipc ptm logindmux ]
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc1ede0 1b 3 0 59 no no t-0 ffffffff8584e220 mdb
1 ffffffff825cb800 b 0 0 -1 no no t-516001 fffffe8000275c80 (idle)
> ffffffff825cb800::print cpu_t cpu_cyclic
cpu_cyclic = 0xffffffff80f45600
> 0xffffffff80f45600::print cyc_cpu_t cyp_backend
cyp_backend = 0xffffffff803d65a8
> 0xffffffff803d65a8::print cyc_backend_t
{
cyb_configure = cbe_configure
cyb_unconfigure = 0
cyb_enable = cbe_enable
cyb_disable = cbe_disable
cyb_reprogram = cbe_reprogram
cyb_softint = cbe_softint
cyb_set_level = cbe_set_level
cyb_restore_level = cbe_restore_level
cyb_xcall = cbe_xcall
cyb_suspend = 0
cyb_resume = 0
cyb_arg = 0xffffffff825cb800
}
>
So we need to trace calls to
cbe_reprogram:
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}
cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
firetime = (arg1 - start)/1000;
printf("%d: cbe_reprogram: set timer to fire at %d\n",
(timestamp - start)/1000, firetime);
}
cyclic_fire:entry
/curcpu->cpu_id == 1/
{
now = (timestamp - start)/1000;
printf("%d: cyclic_fire: %d usecs late\n",
now, now - firetime );
}
fbt::timer_fire:entry
/curcpu->cpu_id == 1/
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}
3839612: cyclic_fire: 80 usecs late
3839614: cbe_reprogram: set timer to fire at 3859532
3839615: interval is 0 usecs late
3859613: cyclic_fire: 81 usecs late
3859614: cbe_reprogram: set timer to fire at 3879532
3859616: interval is 0 usecs late
3879612: cyclic_fire: 80 usecs late
3879614: cbe_reprogram: set timer to fire at 3894055
3879615: interval is 0 usecs late
3894114: cyclic_fire: 59 usecs late
3894115: cbe_reprogram: set timer to fire at 3899532
3899555: cyclic_fire: 23 usecs late
3899556: cbe_reprogram: set timer to fire at 3919532
3899558: interval is -57 usecs late
3919613: cyclic_fire: 81 usecs late
3919614: cbe_reprogram: set timer to fire at 3939532
3919616: interval is 57 usecs late
3939612: cyclic_fire: 80 usecs late
3939614: cbe_reprogram: set timer to fire at 3959532
3939616: interval is 0 usecs late
Ahaha! Turns out we're late all of the time; normally 80
usecs
when there's a full 20 msec between cyclic firings, but less when
the deadman cyclic has fired recently. Computing error as a
percentage of sleep time, we see that our error is a relatively
constant .4% of the interval slept; it's the varying time bewteen
firings (due to multiple cyclics with different periods) that's
responsible for the jitter. Ok, so what is
cbe_reprogram
doing?
Rats, calling through another function pointer. Ok, this time
we'll
use DTrace to figure out where we're going rather than mdb:
cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
self->trace = 1;
}
fbt:::entry,
fbt:::return
/self->trace/
{
}
cbe_reprogram:return
/curcpu->cpu_id == 1/
{
self->trace = 0;
}
1 -> cbe_reprogram
1 -> apic_timer_reprogram
1 -> gethrtime
1 -> tsc_gethrtime
1 <- tsc_gethrtime
1 <- gethrtime
1 <- apic_timer_reprogram
1 <- cbe_reprogram
Ok, now let's look at the source for
apic_timer_reprogram
ticks = (uint_t)((time - now) / apic_nsec_per_tick);
This is introducing a .4 % error... I wonder what the value
of apic_nsecs_per_tick really is?
# echo 'apic_nsec_per_tick/D' | mdb -k
apic_nsec_per_tick:
apic_nsec_per_tick: 10
That's it. We're suffering from an imprecise conversion between
nsecs and apic ticks; the appropriate value cannot be represented
as an integer with this conversion accurately enough to keep excessive
jitter from appearing. Since the deadman cyclic runs every second,
if we want to keep worst case jitter due to quantization errors
down to 1 usec we'll need 1 ppm resolution on the conversion
factor.
This also explained why sometimes the bug isn't visible - if the
a 20 msec cyclic always occurs just before the deadman
cyclic fires, there is almost no change in the amount of time
between cyclic firings.
At this point I filed a bug,
6266961.
The fix involves rewriting the
conversion to use a different factor and redoing the calibration of
the apic timer to use the kernel's idea of hi-res time rather than the
pit timer.
/*
* deal with resolution issues by converting between ticks and
* nanoseconds using 64 bit math.
* use multiplication cause it's faster; divide by power of 2
* for speed.
*/
static int64_t apic_ticks_per_128Mnsecs; /* ticks/128*1024*1024 nsecs */
#define APIC_TICKS_TO_NSECS(ticks) (((long long)(ticks) * 134217728LL)/\
apic_ticks_per_128Mnsecs)
#define APIC_NSECS_TO_TICKS(nsecs) ((long long)(nsecs) * \
apic_ticks_per_128Mnsecs / 134217728LL)
After I get this finished
up, it's back to malloc... more about that later
on.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
Technorati Tag:
DTrace
Technorati Tag:
mdb