The benchmark numbers cited in my earlier entry were generated with a tool called latstat. It was originally written by Bryan Cantrill and has been extensively updated by me. latstat is composed of two parts: a kernel pseudo-driver (latstat) and a user level program (ulatstat). The kernel driver is responsible for creating a timer that fires periodically on an interval provided by the program. The interrupt generated by the timer is what is known in Solaris as a "high-level" interrupt (IPL 14). One of the very few things you can do from a high-level interrupt is to schedule a lower level ("soft") interrupt (this is at IPL 6) which is what the pseudo driver does. The soft interrupt then fires and wakes up ulatstat, which is blocked in an ioctl call in the driver, via cv_signal() (see condvar(9F)). ulatstat then starts running and immediately returns to user land. Timestamps are captured along the way with gethrtime() and stored in a buffer in the kernel for all but the time when ulatstat returned to user land. That is kept in a separate buffer in userland and is matched with the kernel buffer when aggregating is done.
After 5000 iterations, the buffer is read from the kernel, the data is aggregated via a quantization scheme and the cycle starts all over again. The timer interval to be used in each iteration is randomly selected from a set of 24 prime numbers in the neighborhood of 1000. The selected value (e.g., 997) is used as the number of microseconds in each timer interval.
Here is an attempt at a pictorial representation of what's happening:
1) ulatstat is dynamically linked to the system libraries:
libm.so.2 => /lib/libm.so.2
libpthread.so.1 => /lib/libpthread.so.1
librt.so.1 => /lib/librt.so.1
libcpc.so.1 => /usr/lib/libcpc.so.1
libthread.so.1 => /lib/libthread.so.1
libkstat.so.1 => /lib/libkstat.so.1
liblgrp.so.1 => /usr/lib/liblgrp.so.1
libc.so.1 => /lib/libc.so.1
libpctx.so.1 => /usr/lib/libpctx.so.1
libnvpair.so.1 => /lib/libnvpair.so.1
libproc.so.1 => /lib/libproc.so.1
libnsl.so.1 => /lib/libnsl.so.1
librtld_db.so.1 => /lib/librtld_db.so.1
libelf.so.1 => /lib/libelf.so.1
libctf.so.1 => /lib/libctf.so.1
libmp.so.2 => /lib/libmp.so.2
libmd.so.1 => /lib/libmd.so.1
libscf.so.1 => /lib/libscf.so.1
libuutil.so.1 => /lib/libuutil.so.1
libgen.so.1 => /lib/libgen.so.1
2) It finds a CPU on the system to run on and creates a single CPU processor set and drives interrupts away from that set. It then binds itself to that set. On systems with NUMA characteristics, it tries to find a CPU in the "best" processor group and use that. Otherwise it uses the first CPU it finds.
3) It runs in the realtime scheduling class at a priority specified by the user or a default value midway through the priority ranges in the class. In this test, it ran at priority 5 in the realtime scheduling class. Priority 59 would have been the highest priority. The scheduling behavior is SCHED_FIFO.
4) It does an mlockall(MCL_CURRENT|MCL_FUTURE) to lock down all current memory and future memory as well. It takes one pass through the code to ensure that all library calls, variables and the stack segment have been faulted in. All buffers are memset() to zero before use (which ensures faulting). There is no memory allocation during time critical processing.
5) Data is aggregated in ulatstat until 500000 points have been accumulated. At this point a gnuplot control file and data point file are written out. Data in the graphs is plotted with gnuplot. As more points are accumulated, the file is rewritten with the new summary data reflecting the increased number of data points.
6) This test was not run with the LD_BIND_NOW environment variable set. The processing in item 4 is sufficient in this case to ensure that necessary data paths and data structures have been faulted in.
7) The data load on the system is composed of two parts structured to ensure that CPUs not used by the real-time testing were as busy as possible during the testing.
A) a Solaris full nightly build (non-debug and debug) with all packages being built and full checks. This is almost identical to what the gatekeepers in the Solaris group do every night. The only difference is that the executables generated are not signed. Each build cycle takes roughly 7 hours on this system. A while[1] script keeps the builds going on and on and on...
B) Since there are periods when CPUs are somewhat idle during the build process, a secondary load generation process was used. This was the hackbench program used by OSDL for some of their scheduler testing. This compiles and runs on Solaris without change. The hackbench program was run with the argument '240' which is the number of processes created and communicating through pipes. These programs do no disk or network IO. The hackbench program was run in the fixed priority scheduling class at priority 0 and with maximum priority of 0. The effect of this is that they do not run on cpus if anything other than the idle loop wishes to run. hackbench 240 on an idle system runs in about 32.9 seconds. Typical times during the build are on the order of 44 seconds. As is the case with the nightly builds a script keeps the tests running and running and running...
C) For more grins, mpstat 60 was running via an rlogin from a remote system.
8) This is running on Build 65 of Nevada bfu'ed to a slightly later set of bits. The HW is an x4200 with 3986 MB of memory (an odd number but that's what prtconf says) and 4 disk drives. Only the main disk drive (the one / is on) was used for this test.
Posted by c0t0d0s0.org on May 30, 2007 at 05:20 AM PDT #