TLBs on the chip are precious resources and some applications seem to encounter a lot of TLB misses (which are expensive in performance terms) either due to a particular usage of the dataset or poor coding. One of the ways to avoid this cost is to compile the application to use large pages but sometimes this seems not to help and you would still incur the TLB cost. You would try and compile a program to use large pages (-xpagesize=64K or 4M) but still end up with a gazillion 8K pages? What exactly is happening behind the scenes?
The reason for this behavior is that the operating system has to have an address aligned to the page size the program is requesting in order to create these large pages. For example, if you do a pmap on a simple hello executable, here is what it would look like:
00010000 8K r-x-- /home/gv131871/a.out
00020000 8K rwx-- /home/gv131871/a.out
00022000 56K rwx-- [ heap ]
FF280000 832K r-x-- /lib/libc.so.1
FF350000 32K r-x-- /lib/libc.so.1
FF368000 32K rwx-- /lib/libc.so.1
FF370000 8K rwx-- /lib/libc.so.1
FF380000 8K rwx-- [ anon ]
FF390000 8K r-x-- /platform/sun4v/lib/libc_psr.so.1
FF3A0000 24K rwx-- [ anon ]
FF3B0000 128K r-x-- /lib/ld.so.1
FF3D0000 56K r-x-- /lib/ld.so.1
FF3EE000 8K rwx-- /lib/ld.so.1
FF3F0000 8K rwx-- /lib/ld.so.1
FFBF0000 64K rw--- [ stack ]
Notice that the data section of the program starts at the Hex address: 0x00022000 (an 8K aligned address). So the O/S will keep giving 8K pages to the program until in encounters the first address which is aligned to a 4MB boundary (i.e: 0x00400000). So how do you avoid this behavior? One way is to tell the run time linker to map various sections of the program to your preferred boundary. /usr/lib/ld directory already has some map files that can help you achieve this.
For example, let us link in the file map.bssalign to our simple hello:
cc -M /usr/lib/ld/map.bssalign -xpagesize=4M hello.c
00010000 8K r-x-- /home/gv131871/a.out
00020000 8K rwx-- /home/gv131871/a.out
00400000 4096K rwx-- [ heap ]
FF280000 832K r-x-- /lib/libc.so.1
FF350000 32K r-x-- /lib/libc.so.1
FF368000 32K rwx-- /lib/libc.so.1
FF370000 8K rwx-- /lib/libc.so.1
FF380000 8K rwx-- [ anon ]
FF390000 8K r-x-- /platform/sun4v/lib/libc_psr.so.1
FF3A0000 24K rwx-- [ anon ]
FF3B0000 128K r-x-- /lib/ld.so.1
FF3D0000 56K r-x-- /lib/ld.so.1
FF3EE000 8K rwx-- /lib/ld.so.1
FF3F0000 8K rwx-- /lib/ld.so.1
FF800000 4096K rw--- [ stack ]
Notice that now we do have 4MB pages allocated for the heap and that should reduce the contention of the DTLB. You can also align your text and stack sections using these map files. For more information check out the /usr/lib/ld directory.
Here is my own simple mapfile. Seriously, this is all you need to have inside the mapfile:
more ./map.align
bss = A0x400000;
text = A0x20000;
If you link this in, the text section will start at virtual address 0x20000 instead of the standard 0x10000. Not much useful but just to demonstrate that it works. You get my point.
Posted by Geetha Vallabhaneni
@ 11:27 AM PST
[
Comments [0]
]