Trantorian Gazette

Spread-Spectrum EMI and the system clock.

Monday Mar 24, 2008

I was recently pulled into a problem with system clocks. Because I am the "go to guy" for NTP at Sun, I often get involved with system clock problems because they are often only noticed with the customer starts running NTP

Anyway, in this particular case, a customer was evaluating some systems and noticed that the computer clock was seemed to be jumping around quite a bit. The account team set up a couple of test systems and were able to see the same problem. That's when I was called in.

The first thing to do in this kind of situation is to get NTP out of the picture. NTP can only correct clock drifts up to a certain point, and beyond that you start getting some nasty interactions that can obscure the real issues. So, we turned xntpd off and started running "ntpdate -s -q" in a cron job every minute.

This revealed something interesting. The system was drifting to a two second offset in about half an hour and then jumping back to zero offset. This is a typical symptom when you are not running NTP. There is a battery backed hardware TOD (time of day) clock built into almost all modern systems which is used to set the time at boot up. It is also used to double check the system clock. When the system clock and the TOD clock differ by 2 seconds or more, one is set to match the other. Which happens depends on other variables, but in the set up I described, the system clock is almost always reset to match the TOD clock.

So, in this case what was happening was clear. The system clock drifted by 2 seconds each half hour and then was reset by the TOD clock. If you checked less often than once each half hour, it would appear to jump around randomly.

So why was the system losing time? We calculated this out, and it works out to about a 0.125% error rate. This rang a bell with one of the other engineers I was working with. It turns out that the system clock is modulated by 0.25%. I had never heard of this before. It is called "spread-spectrum clock".

Here's the deal. Systems throw off a lot of EMI (electrical magnetic interference) and there are regulations by the FCC as to exactly how much EMI you are allowed to have. One reason that computers throw off this EMI is that they are run using a digital clock signal that synchronizes all of the system components. All of the parts are in sync at exactly the same frequency, so they throw off a lot of EMI at exactly that frequency, with lesser amounts at the harmonics.

The FCC regulations determine the peak EMI at a particular frequency allowed. So, to lower the peak EMI, the idea is to slightly modulate the system frequency, so that the EMI thrown off is spread over a range of frequencies. This lowers the peak while keeping the total energy released constant. At first I thought it sounded like cheating, but I looked it up and the FCC is totally okay with this.

So, in this case, the system frequency is modulated downward by 0.25% three thousand times a second. I guess that because it was modulated so quickly (3KHz) by such a small amount (0.25%) the designers figured that nothing in the software would be affected. The only thing is, when you are counting by nanoseconds, those errors can add up. The average frequency is of course 0.125% slower than the rated frequency which as I said above works out to an error of 2500 part per million. NTP can correct errors of up to 500 PPM, so NTP was way out of its league. We normally require a maximum error of 2 seconds per day on Sun systems, which is an accuracy of 0.0023%. Makes you appreciate how finely tuned these things are.

So, as a result, the firmware is being changed to report the average frequency to the system, not the peak frequency. See? Even a small error can add up

Like this post? del.icio.us | furl | slashdot | technorati | digg

OpenSolaris NTP Project heating up.

Wednesday Apr 25, 2007

There is finally light at the end of the tunnel for the NTP project at OpenSolaris.org. We are going to be releasing test bits soon, and we will need testers when that happens. In addition, we need people to help work on the man pages. The man pages from the NTP open source project are a good starting point, but they need some work to make them more Solaris specific. We also need SMF experts to help design and implement the SMF manifest and methods. And last but not least, we need people to help work on the PSARC documents needed to file the case.

Since this is a open project, we are going to leverage Google docs to help with the collaboration. If you want to help, just send me your email address and I will add you to the document access on Google docs. My email is (explanded to avoid spam) brian dot utterback at sun dot com. And please join the discussion at ntp dash discuss at opensolaris dot org. The project page is at http://opensolaris.org/os/project/ntp/. See you there!

Like this post? del.icio.us | furl | slashdot | technorati | digg

Time Travel 10 milliseconds

Tuesday Apr 24, 2007

I was recently involved with a maddening bug (CR 6540867). The problem was that the gettimeofday system call would on occasion show a time 10 milliseconds in the past. This only affected the amd64 architecture and while running multiple CPUs. This turned out to be an exercise in frustration for me, but also a major learning experience.

I had a test case that would run gettimeofday calls in a tight loop, and check if one ever returned a value prior to one already returned. I used truss to see what system call was being used. Surprise! The gettimeofday call doesn't use any system call, the truss just sat there.

Okay, remembering the rule of "use the source, Luke", I used the OpenGrok source browser at OpenSolaris.org to start to get a handle on just how gettimeofday worked. Drilling down in the the code in libc, I found that on amd64, it is an assembly code line that looks like this:
SYSFASTTRAP(GETHRESTIME)
So where is the corresponding implementation in the kernel for the trap GETHRESTIME? I plugged that into OpenGrok and got another surprise; putting it into the "definition" search got no results, the "symbol" search got the libc implementations, and putting it into "full" search got 110 hits, but none of them was the one I wanted. For full searchs opengrok is case insensitive, so I was getting all the hits for the routine "gethrestime". So where is it?

Turns out that the kernel uses the symbol "T_GETHRESTIME" instead. I found that symbol in the intel architecture subdirectory of the kernel. To make a long story short, what actually gets called goes through a convoluted maze of function pointers and such, but the gist of it is that the trap calls through the trap table to get_hrestime which calls gethrestime which calls the function pointed to by gethrestimef which happens to be pc_gethrestime.

Once in pc_gethrestime, it reads the timestamp set at the last tick, adds the time since the last tick to that and returns it. There is a possible race here, since the timestamp of the last tick is set in the hres_clock routine which could run at the same time on another CPU. If you read the timestamp of the last tick, and then the number of cycles since the last tick, but hres_tick ran in between, you would see the time go backward one tick, i.e. 10 milliseconds. Hmm...sound familiar?

The thing is that there is a lock around the timestamp and the routine that gets the cycles since the last tick so that they are guaranteed to be consistent. The code looks like this:

void pc_gethrestime(timestruc_t *tp)
{
        int lock_prev;
        timestruc_t now;
        int nslt;               /* nsec since last tick */
loop:
        lock_prev = hres_lock;
        now = hrestime;
        nslt = (int)(gethrtime() - hres_last_tick);

 lots of stuff that normalizes the timestamp (elided)

        if ((hres_lock & ~1) != lock_prev)
                goto loop;
Now it took a lot of thinking to see how this lock works. The idea is that the lowest byte of hres_lock is 0 when it is unlocked and -1 (8 one bits) when it is locked. When hres_lock grabs the lock, it uses the atomic amd64 instruction xchgb, which exchanges the value in the byte with -1. If the value returned is 0, then it has the lock. If the value was already -1, then it was already locked and hres_lock needs to go back and try again. When it wants to unlock the lock, it does an atomic increment to it. This changes the 8 one bits to all zero and adds one to the value in the upper three bytes. So when pc_gethrestime stores the value of the lock before it starts and then compares it to the value afterward except the last bit, if the lock was locked before it started, or is locked now or was locked at anytime in between, then the test fails and it goes back and does it all again.

So, try as I might, I just could not come up with a scenario where we could get the inconsistent values. I was banging my head against the wall on this. Finally, I had to go away for a week, and I passed this to my colleague Sudheer Abdul-Salam to work on, and he finally came up with the answer that was eluding me.

Before I tell you the answer, I want to explain something. I have a background writing software, and really take the source code to heart. To me, the source code is the ultimate authority. On the other hand Sudheer is a kernel guy. To a kernel guy, the core dump is the ultimate authority. So Sudheer didn't just look at the source, he looked at the core. And what he found was that for the lines:

        lock_prev = hres_lock;
        now = hrestime;
the compiler was actually producing
pc_gethrestime+0x10: movq  0x3eaee1(%rip),%rax	<--hrestime>
pc_gethrestime+0x17: movq  0x3eaee2(%rip),%rdx	<--hrestime+8>
pc_gethrestime+0x1e: movl  0x3eaefc(%rip),%ebx	<--hres_lock>
That is, the code was reading the hretime timestamp outside of the block locked by hres_lock. When the tick was updated at just that point, the time jumped backward. Well, if you can't trust the compiler, who can you trust?

I am told that the problem is that hres_lock and hrestime should be declared "volatile" and then the compiler would not have reordered the loads. Doh!

Hat's off to Sudheer for figuring this one out and preserving my sanity. Remember that you can go right to the source, but it is the core that is core. I just hope I don't get an issue that involves taking the cash (cache) to the (memory) bank. The only thing more core than core is what's in the cache.

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg