Musings on realtime The jel's weblog

Wednesday Aug 15, 2007

I've been producing a fair amount of source code recently and have become convinced that using two compilers during the process is well worth it and that the normal way that I set up conditionals is flawed.

Two compilers
I've been a Studio fan for all of my time at Sun (through all of the name changes), especially since that was all that the Solaris group used until OpenSolaris was released. Two recent events have made me decide that compiling and testing with gcc in addition to Studio is worth the effort.

While I was working on the code for the stack blog, I decided to try gcc and it flagged something that Studio did not. One of the interfaces I used was umem_cache_create whose first argument is a char *. I was building it from a structure which contained something such as this:

        struct foo {
                ...
                char *name;
                ...
        }


initialized this way:

        struct foo names[] = {
                { ..., "32 KB stack cache",... },
        };

gcc caught the possibility that a compiler might choose to put "32 KB stack cache" into a read-only section of the program and that a potentially read-only datum was being passed to a routine that was not guaranteeing to preserve the data. As I understand the standards, umem_cache_debug() should use "const char *" to indicate that the data passed in won't be changed.

The previous reason is a bit esoteric but the example that really turned my head was when I was overflowing a buffer. How I did it doesn't really matter but a buffer on the stack was overflowed. With Studio's data allocations, I never saw any side effects from doing so. Everything seemed to work just fine with my test cases. When I compiled with gcc, it laid out the data differently and I was able to see that something was going wrong, find the problem and fix it. gcc never told me directly that I was overflowing a buffer but the different data layouts did allow me to see the problem.

Whether you prefer gcc or the Studio compilers, it's worth your time to compile and do testing with the other compiler. You may find things that you never suspected in that code you thought was working perfectly.

In gmake, it's quite trivial to set up your make file to use a different compiler. I use something such as this:

CC = cc
OBJECTS = stacks.o
CFLAGS = -g -xO3 -K pic -I. $(XCFLAGS)
LDFLAGS = -znodefs -G -o $(REALLIB) -h $(LNTARGET) -lumem
ifeq ($(CC),gcc)
        CFLAGS=-g -O2 -fpic -std=c99 -pedantic -Wall
        LDFLAGS=-shared -fpic -o $(REALLIB) -lumem -lc
endif


This uses Studio by default but I can say gmake CC=gcc and build with gcc.

I haven't figured out how to do this with make yet.

Conditionals

I've always structured my conditional tests of the form:

        if (rv == 0) ...

I've now been bitten a few too many times by

        if (rv = 0) ...

which compiles just fine but has interesting results.

I've now adopted a style I first saw in the Windows world:

        if (0x0 == rv) ...

If you mess this up by forgetting one of the '=' signs, the compiler now flags the error for you. gcc's version:

        error: invalid lvalue in assignment

It looks a bit strange if you've had the classic K+R wired into your finger tips but it does make life much, much easier.

I do see that gcc with a -Wall will delicately tell you

        suggest parentheses around assignment used as truth value

which may help to avoid the issue, although I much prefer lint's message:

        warning: assignment operator "=" found where "==" was expected

Studio just lets you do it and gcc without the magic flag to catch it will let you do it as well. lint flags the error but it seems not to be used as much during development and test as it should. I certainly don't run it every time I compile.


Monday Aug 06, 2007

My most recent encomium discussed the subject of stacks and how they were used for programs and threads.

As a follow on to that, I'd like to present a library to allocate thread stacks that avails itself of the facilities of libumem(3LIB).

There are several nice features of libumem discussed in the man page. The one we are directly using is umem_caches. The advantage of using these routines is when the data allocated requires various initialization steps before it can be used. Examples would be creation of a mutex or other synchronization primitive, initialization of fields within a structure or, in our case, setting permissions on pages within the allocated buffer.  If these initialization efforts recur frequently, it can be more effective to use a umem_cache to initialize a data buffer once, use the buffer, return it to its original state (i.e., as initialized) and then to free it back to a cache for that structure.  The next element of code that needs to use such a buffer can request an allocation and if a previously initialized buffer is available, can use that buffer without having to bear the burden of redoing the initialization. Our sample library uses umem_cache_create(3MALLOC), umem_cache_alloc(3MALLOC) and umem_cache_free(3MALLOC).

The library allocates page aligned buffers in sizes that are multiples of the system page size and, for our purposes, start with a size of 32KB and moves in a power-of-two progression to 2MB (32KB -> 64KB -> 128KB -> 256 KB -> 512 KB -> 1 MB -> 2 MB). Only buffers of one of those sizes can be allocated. Each buffer is bracketed by a page of memory with PROT_NONE access permissions so that any reference to an address in those pages (stack underflow or overflow) will cause a segmentation violation. Each buffer is fully faulted when first allocated so that if an mlockall(3C) including the MCL_FUTURE flag was previously used, the buffer will be locked into memory before first use by the thread.

libumem is useful in that it does the caching for us instead of requiring that we write code to
manage the caches.

Three interfaces are offered:

  • void *stack_alloc(size_t size)
  • int stack_free(void *address, size)
  • uint_t *fetch_stack_sizes(void) - returns an array of integers with supported page sizes. The end of the array is a zero valued element. The array may be freed when use of it is ended

The complete source for the library can be found in this (modified) tar file.

umem debugging
I run every program with these environment variables set (on Intel/AMD):

UMEM_DEBUG=audit=64,guard
LD_PRELOAD_32=/lib/libumem.so.1
LD_PRELOAD_64=/lib/amd64/libumem.so.1

I even add these at the end of /etc/profile on the system.

This uses the facilities of libumem to find usage errors with memory allocations in programs. One can also use gcore to grab a core file of a running program and then use mdb's ::findleaks options to find memory leaks within programs. The '64' in the audit entry of UMEM_DEBUG indicates how many stack frames libumem should be prepared to save when tracking allocations and where they are made. It's a depressing exercise to set up firefox or thunderbird to run under the control of these variables and then to examine the core files with mdb.

In addition to finding memory leaks, libumem is good for finding coding errors. Consider this code:

        alloc_t *na;
        na = (alloc_t *)malloc(sizeof(na));

Note: an earlier version of this posting used "*na" in the malloc call - hence the comments.

alloc_t happens to be a struct with a size of 8 bytes but I'm allocating a pointer to it whose size is only 4 bytes. The code, however, treats the allocated space as a struct and sets 8 bytes worth of data, writing beyond the end of what I allocated. When I free this allocated address, libumem forcibly tells me (via a core dump) that I've gone beyond the end of the data that I allocated. This helped point me to an error that it might otherwise have taken a lot of time to find (if it ever did show up).

The corrected code is:

        alloc_t *na;
        na = (alloc_t *)malloc(sizeof(alloc_t));

Seriously consider running all of your programs with these flags while developing and testing even for deployment.



Thursday Jul 26, 2007

As previously noted, an important aspect of real-time performance is avoidance of page faults. In a previous blog, I talked about various memory locking facilities in Solaris and how they worked. The primary focus at that point was on program data. Now let's look at the program stack and considerations in a multithreaded program.

Some Solaris Mechanics
The first stack in a program is built by the kernel when the program is started. Every program gets the first stack at the same location - 0x8048000 on 32-bit Intel and 32-bit AMD, 0xFFC00000 on 32-bit SPARC and 0xFFFFFFFF80000000 on 64-bit SPARC.

There are some cases where the highest usable position in the stack may vary. The sun4v, and currently only that platform, subtracts a pseudo-random skew (additive) from the top of the stack so as to minimize cache line conflicts.

This location is the maximum address of the stack. The stack actually grows downward from this point. The original stack is first populated with several familiar items: the value of argc, the argv array, the environment variable array and one that may not be so familiar, the aux vector array. These typically consume less than one page of memory (even on Intel).

The layout (on 32-bit Intel) when the program starts running is:

 0x804800top of stack - segv if this address dereferenced
 0x8047FFCNULL
 aux vector strings
each null terminated
 env vector strings
each null terminated
 argv vector strings
each null terminated
 NULL4 bytes
 env vector array
each entry pointing at a string in the env vector list following it. This is what envp points at if used in main().
 NULL4 bytes
 arg vector array
each entry pointing at a string in the env vector list following it. This is what argv points at if used in main().
 argc4 bytes for integer

 

When the first of these values (argc) was copied to the new program's stack, a page fault occurred and a page of of memory was attached to the program's stack. All of the other memory between 0x8045FFF and very beginning of the program's address space was left unmapped.

The system does enforce a limit on the size of the initial stack segment, however. The space available for the initial program space on 32-bit Intel is what is between 0x8047FFF and 0x7800000 or 8683520 bytes. The page containing 0x77FFFFF is unmapped and will cause a segmentation fault if dereferenced.

The next page that is attached comes from the actions of the run time loader when the program starts to run. It uses the stack for temporary variable storage and if it pushes something onto the stack which causes the stack to move off the current page, a page fault will occur and another page mapped into the address space. The Solaris kernel handles primary stack pages specially when it sees a page fault. It maps the pages and forces them to be physically attached to the process' address space. All pages between the current end of the stack space and the new end are brought in and attached - whether it is one page or 2000 pages. None of the pages are locked down unless mlock() has been called on the address range for the pages (or mlockall()).

truss(1) note: this behavior has impacts if one uses truss(1) with the -mfltpage option. Only the page in which the fault address is contained will be reported by truss(1). Any other pages brought in will not be reported as faulted pages.

As a result, when program code starts to run (i.e., when main() is invoked), there are typically two to four pages of memory attached (faulted in) to the stack but not locked down. How much stack space the program has attached at some arbitrary time after that is strictly related to the path followed through the program by the code and how data accesses have occurred. If for example, a function is called which has a 200KB array allocated on the stack and the first byte of the array is touched, all 200KB will be faulted and attached. When the function returns, the array is now invalid but the pages attached to the stack for that array still remain.

All of this has described how the primary stack (the one originally created by the kernel) works. Stacks created for other threads within a process are not as neatly handled. Default thread stacks are created by mmaping anonymous memory. Faults within the stack result in one page being mapped into the process' address space. If there are pages which are not mapped between the previous stack boundary. and the just faulted page, they are not mapped in and faulted as they would be for the primary stack.

Each default stack has a size of 1 MB.  The threads library understands that mmap(2) will ensure that a page of memory at the beginning and end of each mapped area is left unmapped so as to cause a segmentation violation if the stack is overflowed or underflowed. It subtracts two pages from the 1 MB size and maps the remaining amount.

For administrative reasons, thread stacks are *not* backed by swap when they are created.  This is one of the very few exceptions made to the requirement that every object using anonymous memory must be backed by swap space when created. The thread stacks are allocated using the MAP_NORESERVE flag. As pages within the stack are touched, an attempt is made to allocate swap space to support the page. If the attempt fails, the program will be terminated.

Given the behavior of mmap(2) in allocating the red zones before and after the mapped area, explicitly setting a guard size of more than 1 page in the thread creation attributes only wastes space. A guard size of 1 page or less is effectively ignored by the thread library since it takes advantage of mmap's redzone behavior. Unless there is a compelling reason to set a larger guardsize, the default should be left as it is.

Thread stacks allocated within the library will be reused once a thread has exited. The library keeps a maximum of 10 in reserve. If more than this number exist, the space consumed by excess stacks is given back to the process by unmapping the space. The number can be overridden with an environment variable. It is possible to receive a stack with all pages faulted in or a stack with very few pages faulted in.

Both a size for the stack as well as a chunk of space to be used by the system for the stack may be specified. The default size of 1 MB is often too big and there are side effects to how mmap allocates space within an address space that make allocating thread stacks and assigning them directly more efficient. This is especially true if one expects to have a large number of threads.

Specifying an area of storage for the thread stack allows the programmer to control the behavior of the stack. The address range can be locked down and the pages prefaulted so as to avoid taking a pagefault during time critical processing or even in a low memory situation, causing the process to terminate. See pthread_attr_setstackaddr(3THR) for details on how to set a stack address and see pthread_attr_setstacksize(3THR) for details on setting the stack's size. The threads library uses the given size. It does not know how the specified memory was created and so does not reduce its size as ti does for internally created stacks.

Making it Easier
Prior to Solaris 10, figuring out stack sizes and checking to see whether one would overflow a stack depended on knowledge of Solaris internals combined with proc(4) hackery as well as knowing how the compiler allocated variables on the stack.

New interfaces available in Solaris 10 greatly simplify this.

stack_inbounds(3C) will tell you whether an address is within the current stack boundary
stack_getbounds(3C) will tell you the current stack boundaries for the current thread
stack_setbounds(3C) allows one to set the stack boundaries for the current thread.

stack_setbounds(3C) is used by the threads library to provide stack information for each lwp/thread to the kernel and is best not used by programmers.

Let's put what we've talked about together and see what happens. Listing 1 contains a program that gathers information about the main program stack, followed by a default thread stack followed by a smaller stack allocated within the program and faulted in. The check_stack function uses mincore(2) to determine how many pages of a segment are in memory. As can be seen, when we start there are very few pages faulted into memory for the main program stack. We then call into the getsinfo() function in the program and use the stack_getbounds(3C) system call to get the boundary information for the current stack. That information is displayed and then we explicitly touch the very end of the stack. Once we've done that, we examine the residency of the pages of the stack. As can be seen, they're now all present.

main stack
check_stack:
        npages: 4 res: 4 nres: 0
        addr: 0x7800000 end: 0x8048000
        ss_size: 0x848000 8683520
        flags: 0x0
check_stack:
        npages: 2120 res: 2120 nres: 0

default thread stack
check_stack:
        npages: 254 res: 1 nres: 253
        addr: 0xcfd23000 end: 0xcfe21000
        ss_size: 0xfe000 1040384
        flags: 0x0
check_stack:
        npages: 254 res: 2 nres: 252
check_stack:
        npages: 254 res: 254 nres: 0

allocated thread stack
check_stack:
        npages: 17 res: 17 nres: 0
        addr: 0xcfcfd000 end: 0xcfd0e000
        ss_size: 0x11000 69632
        flags: 0x0
check_stack:
        npages: 17 res: 17 nres: 0

Recommendations
From this discussion, several recommendations can be made for eliminating taking page faults on program or thread stacks during time-critical processing.

For the primary stack:

  1. Use stack_getbounds() to obtain boundary information
  2. Use mlockall(MCL_CURRENT|MCL_FUTURE) or mlock(the stack address range)
  3. Touch the lowest address in the stack. All of the stack will now be in memory and locked down.

For default thread stacks:

  1. The first thing that must be done in the newly created thread is to call stack_getbounds() to determine the stack information
  2. Next, use mlockall(MCL_CURRENT|MCL_FUTURE) or mlock(the stack address range). If mlockall() has been previously called within the process, this is not needed.
  3. After that, dereference a byte within each page in the stack. All of the stack will now be in memory and locked down.
It is not possible to determine the address of the stack which the thread will use before it starts running. LD_PRELOAD will not work since the stack is allocated using a private mmap() function. All of this work must be done the first time the thread runs.
For thread stacks allocated before hand (recommended):
  1. Allocate memory for the stack
  2. Use mlockall(MCL_CURRENT|MCL_FUTURE) or mlock(the stack address range)
  3. Touch all pages in the stack. All of the stack will now be in memory and locked down.
  4. Set the fields in the thread creation attributes structure to point at the newly allocated stack (pthread_attr_setstackaddr(3THR)) and the stack size (pthread_attr_setstacksize(3THR)).

Friday Jun 01, 2007

Completed 517+ million iterations. There are about 520 points between 13 microseconds and 32 microseconds. There are roughly 180K points between 7 microseconds and 32 microseconds.

Latency results
 

Thursday May 31, 2007



[Read More]

Wednesday May 30, 2007

One of the best kept secrets about Solaris is its real-time capabilities. The foundation for the real-time capabilities is the initial design of Solaris. The designers “wanted the kernel to be capable of bounded dispatch latency for real-time threads which requires absolute control over scheduling, requiring preemption at almost any point in the kernel, and elimination of unbounded priority inversions wherever possible.” [Eykholt 92]

Several key features in the original Solaris design have been enhanced over the years and additional functionality has added to Solaris to make its time-critical processing capabilities more compelling.

Original Design Features

Interrupts as threads

Most UNIX-like operating systems have for many years dealt with the issue of synchronizing data between interrupts and regular kernel processing by ensuring that interrupts simply did not happen while the shared data was manipulated. This was accomplished either by totally disabling interrupts or ensuring that interrupts could only occur for devices whose interrupt priority is above that of the device whose data is being changed. The side effect of this is to either inhibit interrupts for all devices or inhibiting them for all devices below a certain priority level. The act of enabling/disabling interrupts is also an expensive operation.

Every entity that executes code in Solaris whether as a part of a user process or running solely in the kernel is represented by a thread. In prior versions of SunOS, interrupts captured the currently running process on a CPU and executed using that process' data structures. In Solaris2, interrupts are converted into threads with their own data structures. Interrupts can block on kernel synchronization primitives. This enables the ability to protect data structures in the Solaris kernel with synchronization primitives rather than raising and lowering interrupt priority levels.

Fully preemptible kernel

Many UNIX and UNIX-like operating systems use the idea of preemption points. When a kernel entity executing code (whether for the kernel or on behalf of a user process) encounters a preemption point, it looks around to see if there is another entity waiting to run and if so, causes preemption to occur. The main issue with this approach is what happens between the preemption points? A kernel entity may still block for an indefinite time waiting for a resource. The approach often taken is to add even more preemption points but this has diminishing returns as more and more time is spent passing through the points. There are also limits as to where preemption points may be placed. 

Solaris is fully pre-emptible for processes in the real-time class. If a RT process becomes runnable, it will immediately be placed on a CPU if its priority is higher than the thread running on that CPU.

Real-Time scheduling

A key design feature of Solaris 2 was that “the scheduling of tasks in the kernel should be deterministic. ... the scheduler should provide priority-based scheduling for user tasks, so that the time-critical application developer has control of the scheduling behavior of the system; the kernel should provide bounded dispatch latency, so that time-critical user tasks are not subjected to unexpected and undesirable delays; the kernel should be free from unbounded priority inversions.” {Khanna 92]

Solaris offers a variety of schedulers using the System V scheduling class interfaces. These interfaces allow each schedulable entity within the kernel to have a scheduler appropriate to its needs. Entities needing real-time response latencies can use the real-time scheduling class which offers two options: round-robin or FIFO scheduling.

Round-robin scheduling assigns each priority level a time quanta. Entities in that priority level run until that time quanta expires. If another entity of the same priority is ready to run, it will run until its quantum expires. If no other entity is ready to run at that priority, the current entity will continue to run.

FIFO scheduling means that there is no time quantum assigned and that the entity will run until it voluntarily gives up the CPU or an entity with a higher priority becomes ready to run. Two entities with the same priority will run one after the one when one voluntarily gives up the CPU.

Entities in the real-time scheduling class are immediately scheduled to run if they become runnable and are of higher priority than what is currently running. Entities not in the real-time class will suffer delays when they become runnable even if they are of higher priority than what is running.

Priority Inheriting Synchronization primitives

The primary synchronization primitive in the Solaris kernel is the mutex. Mutexes are a light weight mechanism favored over spin locks. They are adaptive in that if the entity owning the mutex is not currently running then the entity attempting to gain the mutex is put to sleep as there is no chance of gaining the mutex while the current owner is not asleep. Mutexes (and other primitives) offer the possibility of “priority inversion.” A typical priority inversion is when a high-priority entity desires a mutex owned by a low priority entity. If the low priority thread becomes runnable, and by inference would then release the mutex, but is blocked from running by another entity whose priority is less than the high priority entity but greater than the low-priority entity. A recent well-known occurrence of this occurred in the first Mars Rover where a housekeeping task on the Rover caused an inversion that inhibited functioning of the Rover's guidance software.

So as to avoid this problem, Solaris mutexes implement the basic priority inheritance protocol as described in [Sha 90]. When the high level entity blocks, all of the entities blocking it are given the high level entity's priority. When they cease to block the thread, their priorities revert to their previous level.

Key Features Added

Processor Sets

Processor sets provide the ability to select one or more CPUs and isolate them from the rest of the system in terms of what entities are scheduled to run on them. Once a set is created, all processes currently run on the CPUs are moved to other CPUs. Processes with the appropriate privilege may add themselves to the set and be scheduled to run on CPUs within the set. It is also possible to force and interrupts currently being delivered to CPUs within the set to move to other CPUs, this isolating processes running on CPUs in the set from the effect of interrupt handlers disrupting processing. Processes and or threads within the process may be bound to individual CPUs within the set.

POSIX Compliance

Solaris has almost full support for what was once known as POSIX 1003.1b (“realtime”) and full support for POSIX 1003.1c (“threads”). These are now part of the IEEE 1003.1, 2004 Edition standard. Support of requisite real-time features is indicated by support of various option groups (e.g., Thread Priority Inheritance). These options are also grouped into larger bundles, of which four exist relating to real-time and threads. These are “Real-time”, “Advanced Real-time”, “Real-time Threads” and “Advanced Real-time Threads”. Solaris satisfies all but one option group within the “Real-time” group. That group is prioritized I/O and it is not clear that any operating system that does I/O request queuing would satisfy. Solaris fully satisfies the “Realtime Threads” option. Support for “Advanced Real-time threads” is missing one option group while “Advanced Realtime” support is missing a number of option groups.

Note that POSIX compliance only indicates that set of interfaces and behaviors exist. It says nothing about how capable an OS is of meeting response time requirements.

Fixed Priority Scheduler

While not truly a real-time feature, Solaris does offer a fixed priority scheduler to aid in processing. Unlike the normal time sharing scheduling class, the fixed priority class does not modify a process' priority as it runs. The priority changes only if the process asks for the change and has the appropriate privilege for the change. This provides an option for those processes who were placed in the real-time class only because of the need for fixed priorities.

REFERENCES

[Eykholt 92]

J.R. Eykholt, S.R.Kleiman, S.Barton, R.Faulkner, A. Shivalingiah, M. Sm0ith, D. Stein, J. Voll, M. Weeks, D. Williams, Beyond Multiprocessing ... Multithreading the SunOS Kernel, Proceedings USENIX Summer 1992

[Khanna 92]

S. Khanna, M. Sebree, J. Zolnowsky, Realtime Scheduling in SUNOS5.0, USENIX, Winter 1992.

{Kleiman 95]

S.Kleiman, J. Eykholt, Interrupts As Threads, ACM SIGOPS Operating System Review, Vol. 21, Issue 2, April, 1995, pp 21-26.

[Sha 90]

L. Sha, R. Rajkumar, J,P, Lehoczky, Priority Inheritance Protocols: An Approach to Real-Time Synchronization, IEEE Transactions on Computers, Vol. 39, No. 9, September, 1990.

Tuesday May 29, 2007

The benchmark numbers cited in my earlier entry were generated with a tool called latstat. It was originally written by Bryan Cantrill and has been extensively updated by me. latstat is composed of two parts: a kernel pseudo-driver (latstat) and a user level program (ulatstat). The kernel driver is responsible for creating a timer that fires periodically on an interval provided by the program. The interrupt generated by the timer is what is known in Solaris as a "high-level" interrupt (IPL 14). One of the very few things you can do from a high-level interrupt is to schedule a lower level ("soft") interrupt (this is at IPL 6) which is what the pseudo driver does. The soft interrupt then fires and wakes up ulatstat, which is blocked in an ioctl call in the driver, via cv_signal() (see condvar(9F)). ulatstat then starts running and immediately returns to user land. Timestamps are captured along the way with gethrtime() and stored in a buffer in the kernel for all but the time when ulatstat returned to user land. That is kept in a separate buffer in userland and is matched with the kernel buffer when aggregating is done.

After 5000 iterations, the buffer is read from the kernel, the data is aggregated via a quantization scheme and the cycle starts all over again. The timer interval to be used in each iteration is randomly selected from a set of 24 prime numbers in the neighborhood of 1000. The selected value (e.g., 997) is used as the number of microseconds in each timer interval.

Here is an attempt at a pictorial representation of what's happening:

flow of processing 

1) ulatstat is dynamically linked to the system libraries:

        libm.so.2 =>     /lib/libm.so.2
        libpthread.so.1 =>       /lib/libpthread.so.1
        librt.so.1 =>    /lib/librt.so.1
        libcpc.so.1 =>   /usr/lib/libcpc.so.1
        libthread.so.1 =>        /lib/libthread.so.1
        libkstat.so.1 =>         /lib/libkstat.so.1
        liblgrp.so.1 =>  /usr/lib/liblgrp.so.1
        libc.so.1 =>     /lib/libc.so.1
        libpctx.so.1 =>  /usr/lib/libpctx.so.1
        libnvpair.so.1 =>        /lib/libnvpair.so.1
        libproc.so.1 =>  /lib/libproc.so.1
        libnsl.so.1 =>   /lib/libnsl.so.1
        librtld_db.so.1 =>       /lib/librtld_db.so.1
        libelf.so.1 =>   /lib/libelf.so.1
        libctf.so.1 =>   /lib/libctf.so.1
        libmp.so.2 =>    /lib/libmp.so.2
        libmd.so.1 =>    /lib/libmd.so.1
        libscf.so.1 =>   /lib/libscf.so.1
        libuutil.so.1 =>         /lib/libuutil.so.1
        libgen.so.1 =>   /lib/libgen.so.1

2) It finds a CPU on the system to run on and creates a single CPU processor set and drives interrupts away from that set. It then binds itself to that set. On systems with NUMA characteristics, it tries to find a CPU in the "best" processor group and use that. Otherwise it uses the first CPU it finds.

3) It runs in the realtime scheduling class at a priority specified by the user or a default value midway through the priority ranges in the class. In this test, it ran at priority 5 in the realtime scheduling class. Priority 59 would have been the highest priority. The scheduling behavior is SCHED_FIFO.

4) It does an mlockall(MCL_CURRENT|MCL_FUTURE) to lock down all current memory and future memory as well. It takes one pass through the code to ensure that all library calls, variables and the stack segment have been faulted in. All buffers are memset() to zero before use (which ensures faulting). There is no memory allocation during time critical processing.

5) Data is aggregated in ulatstat until 500000 points have been accumulated. At this point a gnuplot control file and data point file are written out. Data in the graphs is plotted with gnuplot. As more points are accumulated, the file is rewritten with the new summary data reflecting the increased number of data points.

6) This test was not run with the LD_BIND_NOW environment variable set. The processing in item 4 is sufficient in this case to ensure that necessary data paths and data structures have been faulted in.

7) The data load on the system is composed of two parts structured to ensure that CPUs not used by the real-time testing were as busy as possible during the testing.

A) a Solaris full nightly build (non-debug and debug) with all packages being built and full checks. This is almost identical to what the gatekeepers in the Solaris group do every night. The only difference is that the executables generated are not signed. Each build cycle takes roughly 7 hours on this system. A while[1] script keeps the builds going on and on and on...

B) Since there are periods when CPUs are somewhat idle during the build process, a secondary load generation process was used. This was the hackbench program used by OSDL for some of their scheduler testing. This compiles and runs on Solaris without change. The hackbench program was run with the argument '240' which is the number of processes created and communicating through pipes. These programs do no disk or network IO. The hackbench program was run in the fixed priority scheduling class at priority 0 and with maximum priority of 0. The effect of this is that they do not run on cpus if anything other than the idle loop wishes to run. hackbench 240 on an idle system runs in about 32.9 seconds. Typical times during the build are on the order of 44 seconds. As is the case with the nightly builds a script keeps the tests running and running and running...

 C) For more grins, mpstat 60 was running via an rlogin from a remote system.

8) This is running on Build 65 of Nevada bfu'ed to a slightly later set of bits. The HW is an x4200 with 3986 MB of memory (an odd number but that's what prtconf says) and 4 disk drives. Only the main disk drive (the one / is on) was used for this test.

Just how good is Solaris realtime? Pretty good. I've been running some tests on a 4 way Opteron box. Running an updated latstat, I decided to copy what the Linux folks have been doing in their world and add a "bit of load" to the mix.

A sample mpstat:

 CPU

minf
mjf
xcal
intr
ithr
csw
icsw
migr
smtx
srw
syscl
usr
sys
wt
idl
     0

     0
     0
     0
1662
 830
1659
     0
     0
       0
     0
     830
      0
      0
     0
100
    1

7636
     4
 310
1657
 245
4759
1776
 248
25182
   26
126552
    33
    67
     0
    0
    2

9271
     3
 156
1387
   56
4998
1785
 372
27408
   23
153037
    27
    73
     0
    0
    3


8816
     3 170
1412
     1
4854
1788
 540
27938
   37
150544
    27
    73
     0
    0


# uptime

  1:45am  up 4 day(s), 21:49,  3 users,  load average: 3499.27, 3795.80, 3734.16

This is with a full nightly build of Solaris coupled with a background job that throws 240 processes on the system talking through pipes. The background jobs run at CPU priority 0 so they only run when CPU cycles are available. The realtime task is running on CPU 0 in a single CPU processor set with interrupts driven away and waits for a high level timer interrupt. It takes measurements of its progress from receiving the interrupt until it returns to user land. The results are in this postscript file. The best sheet to look at is the last one. This is simply the delta between the time the interrupt was to be delivered and the time the application was running in userspace. This graph is from over 320 million points. The worst is about 30 microseconds and there are roughly 300 points out of the 320 million that are in the range of 13 microseconds to 30 microseconds. The fun part is to understand what's going on in that space. This run will continue for several more days.

The system is a 4 CPU x4200 with 4 disks and 3968 MB of physical memory. Everything is done locally. There is some network traffic from mpstat displays and other things.

More later including details as to the benchmark and what each of the other sheets mean.