Darryl Gove's blog

Friday Jul 11, 2008

Reading the %tick counter

It's often necessary to time the duration of a piece of code. To do this I often use gethrtime(), which is typically a quick call. Obviously, the faster the call more accurate I can expect my timing to be. Sometimes the call to gethrtime is too long (or perhaps too costly, because it's done so frequently). The next thing to try is to read the %tick register.

The %tick register is a 64-bit register that gets incremented on every cycle. Since it's stored on the processor reading the value of the register is a low cost operation. The only complexity is that being a 64-bit value it needs special handling under 32-bit codes where the a 64-bit return value from a function is passed in the %o0 (upper bits) and %o1 (lower bits) registers.

The inline template to read the %tick register in a 64-bit code is very simple

.inline tk,0
   rd %tick,%o0
.end

The 32-bit version requires two more instructions to get the return value into the appropriate registers:

.inline tk,0
   rd %tick,%o2
   srlx %o2,32,%o0
   sra %o2,0,%o1
.end

Here's an example code which uses the %tick register to get the current value plus an estimate of the cost of reading the %tick register:

#include 

long long tk();

void main()
{
  printf("value=%llu duration=%llu\n",tk(),-tk()+tk());
}

The compile line is:

$ cc -O tk.c tk.il

The output should be something like:

$ a.out
value=4974674895272956 duration=32

Indicating that 32 cycles (in this instance) elapsed between the two reads of the %tick register. Looking at the disassembly, there are certainly a number of cycles of overhead:

        10bbc:  95 41 00 00  rd         %tick, %o2
        10bc0:  91 32 b0 20  srlx       %o2, 32, %o0
        10bc4:  93 3a a0 00  sra        %o2, 0, %o1
        10bc8:  97 32 60 00  srl        %o1, 0, %o3 
        10bcc:  99 2a 30 20  sllx       %o0, 32, %o4
        10bd0:  88 12 c0 0c  or         %o3, %o4, %g4
        10bd4:  c8 73 a0 60  stx        %g4, [%sp + 96]
        10bd8:  95 41 00 00  rd         %tick, %o2

This overhead can be reduced by treating the %tick register as a 32-bit read, and effectively ignoring the upper bits. For very short duration codes this is probably acceptable, but is unsuitable for longer running code blocks. With this (inelegant hack) the following code is generated:

        10644:  91 41 00 00  rd         %tick, %o0
        10648:  b0 07 62 7c  add        %i5, 636, %i0
        1064c:  b8 10 00 08  mov        %o0, %i4
        10650:  91 41 00 00  rd         %tick, %o0

Which returns usually a value of 8 cycles on the same platform.

Comments:

Post a Comment:
Comments are closed for this entry.

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming solaris solarisapplicationprogramming sparc spec spot sunstudio t2 ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds