Darryl Gove's blog
Locating DTLB misses using the Performance Analyzer
DTLB misses typically appear in the Performance Analyzer as loads with significant user time. The following code strides through memory in blocks of 8192 bytes, and so encounters many DTLB misses
#include<stdlib.h>
void main()
{
double *a;
double total=0;
int i;
int j;
a=(double*)calloc(sizeof(double),10*1024*1024+10001);
for (i=0;i<10000;i++)
for(j=0;j<10*1024*1024;j+=1024)
total+=a[j+i];
}
A profile can be gathered as follows:
cc -g -O -xbinopt=prepare -o tlb tlb.c collect tlb
Viewing the profile for the main loop using er_print produces the following snippet:
Excl.
User CPU
sec.
...
0.230 [11] 10c70: prefetch [%i5 + %i1], #n_reads
0. [11] 10c74: faddd %f4, %f2, %f8
3.693 [11] 10c78: ldd [%i5], %f12
## 7.685 [10] 10c7c: add %i5, %i3, %o3
0. [11] 10c80: faddd %f8, %f0, %f10
4.183 [11] 10c84: ldd [%o3], %f2
## 6.935 [10] 10c88: add %o3, %i3, %o1
0. [11] 10c8c: faddd %f10, %f12, %f4
0. [10] 10c90: inc 3072, %i4
4.123 [11] 10c94: ldd [%o1], %f0
## 7.065 [10] 10c98: cmp %i4, %i0
0. [10] 10c9c: ble,pt %icc,0x10c70
0. [10] 10ca0: add %o1, %i3, %i5
This can be compared with the situation where mpss.so.1 has been preloaded to enable the application to get large pages:
0. [11] 10c70: prefetch [%i5 + %i1], #n_reads 0. [11] 10c74: faddd %f4, %f2, %f8 0. [11] 10c78: ldd [%i5], %f12 ## 7.445 [10] 10c7c: add %i5, %i3, %o3 0. [11] 10c80: faddd %f8, %f0, %f10 0.220 [11] 10c84: ldd [%o3], %f2 ## 6.955 [10] 10c88: add %o3, %i3, %o1 0. [11] 10c8c: faddd %f10, %f12, %f4 0. [10] 10c90: inc 3072, %i4 0.340 [11] 10c94: ldd [%o1], %f0 0. [10] 10c98: cmp %i4, %i0 0. [10] 10c9c: ble,pt %icc,0x10c70 0. [10] 10ca0: add %o1, %i3, %i5
The difference between the two profiles is the appearance of user time attributed directly to the load instruction (and not the normal instruction after the load).
It is possible to confirm that these are DTLB misses using the Performance Analyzer's ability to profile an application using the hardware performance counters:
collect -h DTLB_miss tlb
...
Excl.
DTLB_miss
Events
...
0 [11] 10c70: prefetch [%i5 + %i1], #n_reads
0 [11] 10c74: faddd %f4, %f2, %f8
30000090 [11] 10c78: ldd [%i5], %f12
0 [10] 10c7c: add %i5, %i3, %o3
0 [11] 10c80: faddd %f8, %f0, %f10
## 42000126 [11] 10c84: ldd [%o3], %f2
0 [10] 10c88: add %o3, %i3, %o1
0 [11] 10c8c: faddd %f10, %f12, %f4
0 [10] 10c90: inc 3072, %i4
30000090 [11] 10c94: ldd [%o1], %f0
0 [10] 10c98: cmp %i4, %i0
0 [10] 10c9c: ble,pt %icc,0x10c70
0 [10] 10ca0: add %o1, %i3, %i5
The events are reported on the load instructions that are causing the DTLB misses.
Posted at 09:00AM Apr 19, 2007 by Darryl Gove in Sun |


