Darryl Gove's blog

Thursday Apr 19, 2007

Locating DTLB misses using the Performance Analyzer

DTLB misses typically appear in the Performance Analyzer as loads with significant user time. The following code strides through memory in blocks of 8192 bytes, and so encounters many DTLB misses

#include<stdlib.h>
void main()
{
  double *a;
  double total=0;
  int i;
  int j;
  a=(double*)calloc(sizeof(double),10*1024*1024+10001);
  for (i=0;i<10000;i++)
   for(j=0;j<10*1024*1024;j+=1024)
    total+=a[j+i];
}

A profile can be gathered as follows:

cc -g -O -xbinopt=prepare -o tlb tlb.c
collect tlb

Viewing the profile for the main loop using er_print produces the following snippet:

   Excl.     
   User CPU  
    sec. 
...
   0.230                [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   3.693                [11]    10c78:  ldd         [%i5], %f12
## 7.685                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   4.183                [11]    10c84:  ldd         [%o3], %f2
## 6.935                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   4.123                [11]    10c94:  ldd         [%o1], %f0
## 7.065                [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

This can be compared with the situation where mpss.so.1 has been preloaded to enable the application to get large pages:

   0.                   [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   0.                   [11]    10c78:  ldd         [%i5], %f12
## 7.445                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   0.220                [11]    10c84:  ldd         [%o3], %f2
## 6.955                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   0.340                [11]    10c94:  ldd         [%o1], %f0
   0.                   [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

The difference between the two profiles is the appearance of user time attributed directly to the load instruction (and not the normal instruction after the load).

It is possible to confirm that these are DTLB misses using the Performance Analyzer's ability to profile an application using the hardware performance counters:

collect -h DTLB_miss tlb
...
   Excl.      
   DTLB_miss  
   Events   
...
          0             [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
          0             [11]    10c74:  faddd       %f4, %f2, %f8
   30000090             [11]    10c78:  ldd         [%i5], %f12
          0             [10]    10c7c:  add         %i5, %i3, %o3
          0             [11]    10c80:  faddd       %f8, %f0, %f10
## 42000126             [11]    10c84:  ldd         [%o3], %f2
          0             [10]    10c88:  add         %o3, %i3, %o1
          0             [11]    10c8c:  faddd       %f10, %f12, %f4
          0             [10]    10c90:  inc         3072, %i4
   30000090             [11]    10c94:  ldd         [%o1], %f0
          0             [10]    10c98:  cmp         %i4, %i0
          0             [10]    10c9c:  ble,pt      %icc,0x10c70
          0             [10]    10ca0:  add         %o1, %i3, %i5

The events are reported on the load instructions that are causing the DTLB misses.

Comments:

Post a Comment:
Comments are closed for this entry.

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt communityone compiler cooltools cpu2006 dtrace gcc libraries linker multithreading openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming secondlife solaris solarisapplicationprogramming sparc spot sunstudio ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds