Musings on realtime The jel's weblog

Thursday Jul 26, 2007

As previously noted, an important aspect of real-time performance is avoidance of page faults. In a previous blog, I talked about various memory locking facilities in Solaris and how they worked. The primary focus at that point was on program data. Now let's look at the program stack and considerations in a multithreaded program.

Some Solaris Mechanics
The first stack in a program is built by the kernel when the program is started. Every program gets the first stack at the same location - 0x8048000 on 32-bit Intel and 32-bit AMD, 0xFFC00000 on 32-bit SPARC and 0xFFFFFFFF80000000 on 64-bit SPARC.

There are some cases where the highest usable position in the stack may vary. The sun4v, and currently only that platform, subtracts a pseudo-random skew (additive) from the top of the stack so as to minimize cache line conflicts.

This location is the maximum address of the stack. The stack actually grows downward from this point. The original stack is first populated with several familiar items: the value of argc, the argv array, the environment variable array and one that may not be so familiar, the aux vector array. These typically consume less than one page of memory (even on Intel).

The layout (on 32-bit Intel) when the program starts running is:

 0x804800top of stack - segv if this address dereferenced
 0x8047FFCNULL
 aux vector strings
each null terminated
 env vector strings
each null terminated
 argv vector strings
each null terminated
 NULL4 bytes
 env vector array
each entry pointing at a string in the env vector list following it. This is what envp points at if used in main().
 NULL4 bytes
 arg vector array
each entry pointing at a string in the env vector list following it. This is what argv points at if used in main().
 argc4 bytes for integer

 

When the first of these values (argc) was copied to the new program's stack, a page fault occurred and a page of of memory was attached to the program's stack. All of the other memory between 0x8045FFF and very beginning of the program's address space was left unmapped.

The system does enforce a limit on the size of the initial stack segment, however. The space available for the initial program space on 32-bit Intel is what is between 0x8047FFF and 0x7800000 or 8683520 bytes. The page containing 0x77FFFFF is unmapped and will cause a segmentation fault if dereferenced.

The next page that is attached comes from the actions of the run time loader when the program starts to run. It uses the stack for temporary variable storage and if it pushes something onto the stack which causes the stack to move off the current page, a page fault will occur and another page mapped into the address space. The Solaris kernel handles primary stack pages specially when it sees a page fault. It maps the pages and forces them to be physically attached to the process' address space. All pages between the current end of the stack space and the new end are brought in and attached - whether it is one page or 2000 pages. None of the pages are locked down unless mlock() has been called on the address range for the pages (or mlockall()).

truss(1) note: this behavior has impacts if one uses truss(1) with the -mfltpage option. Only the page in which the fault address is contained will be reported by truss(1). Any other pages brought in will not be reported as faulted pages.

As a result, when program code starts to run (i.e., when main() is invoked), there are typically two to four pages of memory attached (faulted in) to the stack but not locked down. How much stack space the program has attached at some arbitrary time after that is strictly related to the path followed through the program by the code and how data accesses have occurred. If for example, a function is called which has a 200KB array allocated on the stack and the first byte of the array is touched, all 200KB will be faulted and attached. When the function returns, the array is now invalid but the pages attached to the stack for that array still remain.

All of this has described how the primary stack (the one originally created by the kernel) works. Stacks created for other threads within a process are not as neatly handled. Default thread stacks are created by mmaping anonymous memory. Faults within the stack result in one page being mapped into the process' address space. If there are pages which are not mapped between the previous stack boundary. and the just faulted page, they are not mapped in and faulted as they would be for the primary stack.

Each default stack has a size of 1 MB.  The threads library understands that mmap(2) will ensure that a page of memory at the beginning and end of each mapped area is left unmapped so as to cause a segmentation violation if the stack is overflowed or underflowed. It subtracts two pages from the 1 MB size and maps the remaining amount.

For administrative reasons, thread stacks are *not* backed by swap when they are created.  This is one of the very few exceptions made to the requirement that every object using anonymous memory must be backed by swap space when created. The thread stacks are allocated using the MAP_NORESERVE flag. As pages within the stack are touched, an attempt is made to allocate swap space to support the page. If the attempt fails, the program will be terminated.

Given the behavior of mmap(2) in allocating the red zones before and after the mapped area, explicitly setting a guard size of more than 1 page in the thread creation attributes only wastes space. A guard size of 1 page or less is effectively ignored by the thread library since it takes advantage of mmap's redzone behavior. Unless there is a compelling reason to set a larger guardsize, the default should be left as it is.

Thread stacks allocated within the library will be reused once a thread has exited. The library keeps a maximum of 10 in reserve. If more than this number exist, the space consumed by excess stacks is given back to the process by unmapping the space. The number can be overridden with an environment variable. It is possible to receive a stack with all pages faulted in or a stack with very few pages faulted in.

Both a size for the stack as well as a chunk of space to be used by the system for the stack may be specified. The default size of 1 MB is often too big and there are side effects to how mmap allocates space within an address space that make allocating thread stacks and assigning them directly more efficient. This is especially true if one expects to have a large number of threads.

Specifying an area of storage for the thread stack allows the programmer to control the behavior of the stack. The address range can be locked down and the pages prefaulted so as to avoid taking a pagefault during time critical processing or even in a low memory situation, causing the process to terminate. See pthread_attr_setstackaddr(3THR) for details on how to set a stack address and see pthread_attr_setstacksize(3THR) for details on setting the stack's size. The threads library uses the given size. It does not know how the specified memory was created and so does not reduce its size as ti does for internally created stacks.

Making it Easier
Prior to Solaris 10, figuring out stack sizes and checking to see whether one would overflow a stack depended on knowledge of Solaris internals combined with proc(4) hackery as well as knowing how the compiler allocated variables on the stack.

New interfaces available in Solaris 10 greatly simplify this.

stack_inbounds(3C) will tell you whether an address is within the current stack boundary
stack_getbounds(3C) will tell you the current stack boundaries for the current thread
stack_setbounds(3C) allows one to set the stack boundaries for the current thread.

stack_setbounds(3C) is used by the threads library to provide stack information for each lwp/thread to the kernel and is best not used by programmers.

Let's put what we've talked about together and see what happens. Listing 1 contains a program that gathers information about the main program stack, followed by a default thread stack followed by a smaller stack allocated within the program and faulted in. The check_stack function uses mincore(2) to determine how many pages of a segment are in memory. As can be seen, when we start there are very few pages faulted into memory for the main program stack. We then call into the getsinfo() function in the program and use the stack_getbounds(3C) system call to get the boundary information for the current stack. That information is displayed and then we explicitly touch the very end of the stack. Once we've done that, we examine the residency of the pages of the stack. As can be seen, they're now all present.

main stack
check_stack:
        npages: 4 res: 4 nres: 0
        addr: 0x7800000 end: 0x8048000
        ss_size: 0x848000 8683520
        flags: 0x0
check_stack:
        npages: 2120 res: 2120 nres: 0

default thread stack
check_stack:
        npages: 254 res: 1 nres: 253
        addr: 0xcfd23000 end: 0xcfe21000
        ss_size: 0xfe000 1040384
        flags: 0x0
check_stack:
        npages: 254 res: 2 nres: 252
check_stack:
        npages: 254 res: 254 nres: 0

allocated thread stack
check_stack:
        npages: 17 res: 17 nres: 0
        addr: 0xcfcfd000 end: 0xcfd0e000
        ss_size: 0x11000 69632
        flags: 0x0
check_stack:
        npages: 17 res: 17 nres: 0

Recommendations
From this discussion, several recommendations can be made for eliminating taking page faults on program or thread stacks during time-critical processing.

For the primary stack:

  1. Use stack_getbounds() to obtain boundary information
  2. Use mlockall(MCL_CURRENT|MCL_FUTURE) or mlock(the stack address range)
  3. Touch the lowest address in the stack. All of the stack will now be in memory and locked down.

For default thread stacks:

  1. The first thing that must be done in the newly created thread is to call stack_getbounds() to determine the stack information
  2. Next, use mlockall(MCL_CURRENT|MCL_FUTURE) or mlock(the stack address range). If mlockall() has been previously called within the process, this is not needed.
  3. After that, dereference a byte within each page in the stack. All of the stack will now be in memory and locked down.
It is not possible to determine the address of the stack which the thread will use before it starts running. LD_PRELOAD will not work since the stack is allocated using a private mmap() function. All of this work must be done the first time the thread runs.
For thread stacks allocated before hand (recommended):
  1. Allocate memory for the stack
  2. Use mlockall(MCL_CURRENT|MCL_FUTURE) or mlock(the stack address range)
  3. Touch all pages in the stack. All of the stack will now be in memory and locked down.
  4. Set the fields in the thread creation attributes structure to point at the newly allocated stack (pthread_attr_setstackaddr(3THR)) and the stack size (pthread_attr_setstacksize(3THR)).