Tuesday August 23, 2005 Here I discuss new set of tools giving developers and users insight into the NUMA characteristics of their systems and applications and explicit control knobs to affect NUMA-related properties of their applications.
While software developers are very busy developing more and more "features" hardware designers are very successful in sustaining the Moore's Law and offsetting the impact of these features. Two orthogonal ways to improve system performance are
While both approaches are quite successful, increasing the number of CPUs proved to be the only way to dramatically improve the overall system performance.
Combining many individual CPUs together to form a computer system requires special hardware to coordinate these CPUs and other system components (memory, caches, I/O, etc). The coordination with the memory subsystem is especially important because
As a result, without careful design the fastest CPU in the world may spend its days waiting for data from memory. The problem is usually solved by introducing various levels of hardware caches which keep popular memory areas closer to the data-hungry CPU circuits.
Another problem in multiprocessor systems is sharing the memory bus that connects memory banks to the CPUs. Imagine a wide highway connecting several major cities. It may be a breeze to go from one city to another at non-peak times, but everyone knows what happens at rush hours when everyone hurries to their destination. The fast highway becomes a slow-crawling mess. The same thing may happen in a traditional Symmetric Multiprocessing (SMP) system with many CPUs sharing the memory bus. The usual way to solve this problem is to partition the system into smaller nodes where only the intra-node traffic requires access to the shared interconnect while data access within the node uses a private bus. This architecture is very similar to the big network where hubs provide fast access within a LAN and routers connect LANs together while containing all local traffic within the LAN. The major property of such partitioned systems is that accessing the data within the node is noticeably faster than accessing data outside the node. Such partitioned systems got the name NUMA which stands for Non-Uniform Memory Access.
Remember that CPUs try to get most of their data from caches rather than directly from memory, so a special hardware components are added to ensure that two CPUs on different nodes see the same piece of data even if the data is in L1 caches of these CPUs. In technical terms this hardware provides cache coherence and most NUMA systems include such cache-coherence protocols, so they are sometimes called ccNUMA systems.
Up until recently NUMA systems was primarily designed by big computer vendors developing really powerful computers. The list includes, but not limited to
But recently AMD introduced NUMA Opteron and Athlon64 platform which brings NUMA technology right on my desktop, thanks to its HyperTransport design. As a result, NUMA-awareness became the mainstream of OS design.
As you can see from the description above, all data movements within a node is much more efficient than intra-node data movements (that's why the whole thing is called Non-Uniform). Since applications usually don't care about low-level hardware architectures, it is up to the OS to make sure that application data stays close enough to its CPU. NUMA-aware OSes try hard to keep application data close to CPUs which massages this data.
The terminology in this area is not very well established and while some implementations use the term node, Solaris uses the concept of locality group or "lgroup" to refer to the collection of nearby resources. Locality groups form a hierarchy with each level representing the degree of "closeness" of resources. I recommend that you take your time to read a nice overview of NUMA implementation on Solaris which contains many explanations of Solaris-specific implementation details.
While OS is usually doing a good job at keeping data and CPUs close to each other, sometimes its good intentions come wrong and the result is quite the opposite meaning that almost all data accesses for some applications become remote, severely affecting performance of such application. This is often the case in scientific software when one thread allocates and prepares a large chunk of memory and then other threads use it extensively. The OS doesn't know that the thread that allocated memory is not the one that is going to use it, so it often keeps the memory close to the allocating thread instead of the thread that is actually using the memory. Of course, ideally the OS should detect such imbalance and automatically rearrange things, but in practice this problem is very hard. In such cases application developers or users may help OS correct the situation and provide additional hints. To do this effectively, users need to understand the NUMA properties of their systems and get some insights into the ways OS placed threads and their memory across CPU and memory boards. To correct things users also need tools to modify such placement. Software developers can use the Locality Group API.
The rest of this discussion will focus on such tools for Solaris operating environment. I assume that you checked the NUMA Observability tools page on the OpenSolaris site and downloaded the tools. We will discuss how to use
lgrpinfo(1): tool for displaying lgroup hierarchyplgrp(1): proc tool for observing and affecting lgroup affinitiespmadvise(1):
proc tool for applying advice with madvise(3C)pmap(1)
extensions option to display lgroup containing physical memory backing given
virtual address in specified processSolaris::Lgrp: Perl module which gives full
access to lgroup API to Perl scripts
Suppose that you are not happy with the performance of your favorite application
on Solaris and want to improve it by playing with the memory placement
optimization knobs. The first thing you would want to know is whether all this
discussion is relevant for you in the first place i.e. whether your system has
NUMA properties. Running lgrpinfo without any arguments answers
this question:
str2: $ lgrpinfo
lgroup 0 (root):
children: none
CPUs: 0 2
Memory: installed 512 Mb, allocated 275 Mb, free 237 Mb
Lgroup resources: 0 (CPU); 0 (memory)
This machine has only one (root) lgroup and is an UMA system.
Not very interesting for our discussion. This is 2-CPU Ultra 60 system. Let's take a look at another system:
sark: $ lgrpinfo
lgroup 0 (root):
children: 1 2 3
CPUs: 0-3 8-11 16-19
Memory: installed 24576 Mb, allocated 4608 Mb, free 19968 Mb
lgroup 1 (leaf):
children: none, parent: 0
CPUs: 0-3
Memory: installed 8192 Mb, allocated 4247 Mb, free 3945 Mb
lgroup 2 (leaf):
children: none, parent: 0
CPUs: 8-11
Memory: installed 8192 Mb, allocated 194 Mb, free 7998 Mb
lgroup 3 (leaf):
children: none, parent: 0
CPUs: 16-19
Memory: installed 8192 Mb, allocated 167 Mb, free 8025 Mb
This is a NUMA system with three nodes each having 8 Gb of memory and 4 CPUs. Now let us take a look at 4-CPU AMD Opteron box:
gears: $ lgrpinfo
lgroup 0 (root):
children: 3 4 6 8
CPUs: 0-3
Memory: installed 3647 Mb, allocated 394 Mb, free 3253 Mb
Lgroup resources: 1 2 5 7 (CPU); 1 2 (memory)
lgroup 1 (leaf):
children: none, parent: 3
CPU: 0
Memory: installed 2048 Mb, allocated 289 Mb, free 1759 Mb
Lgroup resources: 1 (CPU); 1 (memory)
lgroup 2 (leaf):
children: none, parent: 4
CPU: 1
Memory: installed 1599 Mb, allocated 104 Mb, free 1495 Mb
Lgroup resources: 2 (CPU); 2 (memory)
lgroup 3 (intermediate):
children: 1, parent: 0
CPUs: 0-2
Memory: installed 3647 Mb, allocated 394 Mb, free 3253 Mb
Lgroup resources: 1 2 5 (CPU); 1 2 (memory)
lgroup 4 (intermediate):
children: 2, parent: 0
CPUs: 0 1 3
Memory: installed 3647 Mb, allocated 394 Mb, free 3253 Mb
Lgroup resources: 1 2 7 (CPU); 1 2 (memory)
lgroup 5 (leaf):
children: none, parent: 6
CPU: 2
Lgroup resources: 5 (CPU);
lgroup 6 (intermediate):
children: 5, parent: 0
CPUs: 0 2 3
Memory: installed 2048 Mb, allocated 289 Mb, free 1759 Mb
Lgroup resources: 1 5 7 (CPU); 1 (memory)
lgroup 7 (leaf):
children: none, parent: 8
CPU: 3
Lgroup resources: 7 (CPU);
lgroup 8 (intermediate):
children: 7, parent: 0
CPUs: 1-3
Memory: installed 1599 Mb, allocated 104 Mb, free 1495 Mb
Lgroup resources: 2 5 7 (CPU); 2 (memory)
We can see that it has 4 leaf nodes, 4 intermediate nodes and a root node. Pretty much what we would expect from a 4-way AMD Opteron. Adding some options we can add some eye candy to the output:
gears: $ lgrpinfo -Ta
0
|-- 3
| CPUs: 0-2
| Lgroup resources: 1 2 5 (CPU); 1 2 (memory)
| `-- 1
| CPU: 0
| Memory: installed 2048 Mb, allocated 291 Mb, free 1757 Mb
|-- 4
| CPUs: 0 1 3
| Lgroup resources: 1 2 7 (CPU); 1 2 (memory)
| `-- 2
| CPU: 1
| Memory: installed 1599 Mb, allocated 103 Mb, free 1496 Mb
|-- 6
| CPUs: 0 2 3
| Lgroup resources: 1 5 7 (CPU); 1 (memory)
| `-- 5
| CPU: 2
`-- 8
CPUs: 1-3
Lgroup resources: 2 5 7 (CPU); 2 (memory)
`-- 7
CPU: 3
This output shows the lgroup hierarchy in a more obvious way. We can immediately see that this system is rather interesting - CPU 0 has 2Gb of local memory, CPU 1 has 1.5 Gb of local memory and CPUs 2 and 3 have no local memory at all. Suppose that an application is running on CPU 2 and needs a page of memory. Since there is no local memory available, the system will walk up the hierarchy and will try to allocate memory from lgroup 6. Here is the description of lgroup 6:
|-- 6
| CPUs: 0 2 3
| Lgroup resources: 1 5 7 (CPU); 1 (memory)
| `-- 5
| CPU: 2
We can see that it will try to get memory from lgroup 1 which corresponds to the node containing CPU 0. Similarly, application homed to CPU 3 will usually get its memory from lgroup 2 which corresponds to a node with CPU 1. This all makes perfect sense if you look at the AMD opteron topology:
2----3
| |
| |
0----1
This picture shows that CPU 2 is closest to CPUs 0 and 3 and CPU 3 is closest to 2 and 1.
These topology considerations mean that if a thread is running homed to CPU 0
and the load on this CPU becomes too high, the OS will try to migrate the thread
to the closest CPU - in this case either 2 or 1. This way the OS tries
to keep migrated threads still close to their home memory. We can see this from
the lgrpinfo output above:
|-- 3
| CPUs: 0-2
| Lgroup resources: 1 2 5 (CPU); 1 2 (memory)
| `-- 1
| CPU: 0
| Memory: installed 2048 Mb, allocated 291 Mb, free 1757 Mb
The CPU 0 belongs to lgroup 1 which has lgroup 3 as its parent. The lgroup 3 has lgroups 1 2 and 5 as its CPU resources. Lgroup 1 contains CPU 0 itself, lgroup 2 contains CPU 1 and lgroup 5 contains CPU 2.
Armed with this information we can predict that it is best to run memory-intensive application on CPU 0 or CPU 1 (which is good enough if the working set fits the 1.5 Gb available on CPU1). The CPU-intensive application without much memory needs may do well on CPUs 2 and 3.
The lgrpinfo command has a useful -l option which
shows the latency for memory access between different lgroups. Here is
an example from the same 4-CPU AMD opteron machine:
gears: $ lgrpinfo -l
Lgroup latencies:
0 1 2 3 4 5 6 7 8
0 135 135 135 135 135 - 135 - 135
1 99 66 99 99 99 - 66 - 99
2 99 99 66 99 99 - 99 - 66
3 135 99 135 99 135 - 99 - 135
4 135 135 99 135 99 - 135 - 99
5 135 99 135 135 135 - 99 - 135
6 135 135 135 135 135 - 99 - 135
7 135 135 99 135 135 - 135 - 99
8 135 135 135 135 135 - 135 - 99
We can see that remote access takes about 1.5 times more time than local access. The absolute value of the table is not very interesting, but the relative values provide a good approximation of the access times. This data is collected by the system during boot.
When a new thread is created by the system, it examines load averages on each
CPU and assigns the thread a home lgroup (which is almost always a leaf lgroup).
The scheduler tries to run the thread at its home lgroup or as close to it as
possible. A thread will migrate from its home lgroup if there is too
much imbalance between CPU loads. Note that a thread that is bound to a
specific CPU will always run on this CPU. A thread bound to a processor set will
run on its home lgroup or a closest CPU within the processor set. You can
determine the home lgroup of a process or a thread by calling lgrp_home(3LGRP)
function within an application. Or you can determine the home lgroup of every
thread in a process using the
plgrp utility. Here, for example, we figure out that
sendmail(1M) runs in lgroup 7:
# plgrp -G `pgrep sendmail`
7
# lgrpinfo 7
lgroup 7 (leaf):
children: none, parent: 8
CPU: 3
Lgroup resources: 7 (CPU);
We can see that sendmail is running on CPU 3 which doesn't have any attached memory. If we are concerned about its performance, we can manually move it close to its memory (remember from our previous discussion that the memory is allocated from lgroup 2):
# plgrp -S 2 `pgrep sendmail`
# plgrp -G `pgrep sendmail`
2
What if a process has several threads? For example, for
automountd(1M) we get:
# plgrp -G `pgrep automountd`
7
2
5
We can get information about specific threads:
# plgrp -vG `pgrep automountd`
100473/1: 7
100473/2: 2
100473/4899: 5
The 100473 is the pid of the process and the numbers 1 2 and 4899 are thread IDs. They are useful because we can move each thread individually to a new home:
# plgrp -S 1 100473/1
# plgrp -vG `pgrep automountd`
100473/1: 1
100473/2: 2
Or move all of them at once:
# plgrp -S 1 `pgrep automountd`
# plgrp -vG `pgrep automountd`
100473/1: 1
100473/2: 1
When we move a process or a thread to a new home all its belongings (memory)
stays at its old home. Only new memory allocations use the memory from the new
home. The first question we may ask is what does the current memory allocation
looks like. Grabbing the modified version of pmap(1)
command from the toolkit we can easily determine where each memory segment is
llocated, using the new -L option:
$ plgrp -G `pgrep emacs`
2
$ pmap -L `pgrep emacs`
07FEB000 372K rwx-- 2 [ stack ]
08050000 16K r-x-- 2 /opt/csw/bin/emacs-21.3
08054000 20K r-x-- 1 /opt/csw/bin/emacs-21.3
0838E000 1652K rwx-- 2 /opt/csw/bin/emacs-21.3
0852B000 2900K rwx-- 2 [ heap ]
08800000 3124K rwx-- 2 [ heap ]
BF960000 4K rwx-- 2 [ anon ]
BF970000 24K r-x-- 1 /lib/libm.so.2
BF976000 4K r-x-- 2 /lib/libm.so.2
...
BF9DE000 4K rwxs- 2 [ anon ]
BF9F0000 12K rwx-- 2 [ anon ]
BF9F4000 8K rwx-- 2 [ anon ]
BFA00000 4K rwx-- 2 [ anon ]
BFA10000 4K rwx-- 2 [ anon ]
BFA20000 128K r-x-- 1 /lib/libc.so.1
BFA42000 28K r-x-- 1 /lib/libc.so.1
BFA49000 16K r-x-- 2 /lib/libc.so.1
...
We can see that most of the text segment and all the heap is allocated from the home lgroup. The segments for the library code (libm and libc) is spread around - the library code is shared and its allocation depends on which process allocated it first.
How can we control the memory placement? Application writers can use madvise(3C)
library call to control it from the application itself. It is especially useful
for applications which allocate and initialize memory from one thread and use it
by another one. The user can also use madv.so.1
library to apply memory advise before starting the application. For an already
running application developer can use the pmadvise(1)
tool from the toolkit. The following example shows how we can move Emacs to a
new home and advise it to take along move the memory it uses:
$ plgrp -S 1 `pgrep emacs`
$ pmadvise -o heap=access_lwp,stack=access_lwp `pgrep emacs`
... Play with Emacs a bit ...
$ pmap -L `pgrep emacs` | egrep '(heap|stack)'
$ pmap -L `pgrep emacs` | egrep '(heap|stack)' 08017000 188K rwx-- 1 [ stack ]
08017000 188K rwx-- 1 [ stack ]
08046000 8K rwx-- 2 [ stack ]
0852B000 32K rwx-- 1 [ heap ]
08534000 76K rwx-- 1 [ heap ]
08549000 4K rwx-- 1 [ heap ]
0854B000 20K rwx-- 1 [ heap ]
08551000 16K rwx-- 1 [ heap ]
...
Almost all the stack and all the heap migrated to a new home! We successfully completed the move with just a few commands!
Of course, we can't talk about Solaris without mentioning the almighty DTrace. We have all the tools to control long-running applications, but how can we enforce the policy automatically whenever an application starts? DTrace can this problem, of course! Let us create the following D program and call it papply.d:
#!/usr/sbin/dtrace -qws
/* We use -w to allow destructive actions */
/*
* When the process reaches main(), stop it, re-home and restart again.
*/
pid$target::main:entry
{
stop();
system("plgrp -S 1 %d", $target);
system("pmadvise -o heap=access_lwp,stack=access_lwp %d", $target);
system("prun %d", $target);
}
Now we can do things like
$ apply.d -c emacs
Unfortunately, this doesn't work since plgrp
can't deal with a process which is already traced,
although pmap and pmadvise can if
-F option is supplied. The next
version of the plgrp will include
-F option which will fix the problem. Stay tuned!
In my
earlier entry I discussed the issues with flow control. This time I want to
look at the module which actually looks at the data being passed and will
convert all output to upper case letters. Basically, it will do the same thing
as tr a-z A-Z command does, but using STREAMS
upmod_upcase() function that
examines all M_DATA messages in the mblk and converts every symbol to upper
case:
#define islower(x) (((unsigned)(x) >= 'a') && ((unsigned)(x) <= 'z'))
#define isupper(x) (((unsigned)(x) >= 'A') && ((unsigned)(x) <= 'Z'))
#define toupper(x) (isupper(x) ? (x) : (unsigned)(x) - 'a' + 'A')
/*
* Convert all ASCII chars in data blocks to upper case
*/
static mblk_t *
upmod_upcase(mblk_t *passed_mp)
{
mblk_t *mp = passed_mp;
for (; mp != NULL; mp = mp->b_cont) {
if ((DB_TYPE(mp) == M_DATA) && (MBLKL(mp) > 0)) {
unsigned char *p;
for (p = mp->b_rptr; p < mp->b_wptr; p++)
if (islower(*p))
*p = toupper(*p);
}
}
return (passed_mp);
}
The DB_TYPE(mp) macro simply returns
mp->b_datab->db_type value and MBLKL(mp) is the
amount of data between the read and write pointers . These macros together
with some other useful definitions are defined in
sys/strsun.h file:
#define DB_BASE(mp) ((mp)->b_datap->db_base) #define DB_LIM(mp) ((mp)->b_datap->db_lim) #define DB_REF(mp) ((mp)->b_datap->db_ref) #define DB_TYPE(mp) ((mp)->b_datap->db_type) #define MBLKL(mp) ((mp)->b_wptr - (mp)->b_rptr) #define MBLKSIZE(mp) ((mp)->b_datap->db_lim - (mp)->b_datap->db_base) #define MBLKHEAD(mp) ((mp)->b_rptr - (mp)->b_datap->db_base) #define MBLKTAIL(mp) ((mp)->b_datap->db_lim - (mp)->b_wptr) #define MBLKIN(mp, off, len) (((off) <= MBLKL(mp)) && \ (((mp)->b_rptr + (off) + (len)) <= (mp)->b_wptr))
Now we can modify the read-side put procedure to call
upmod_upcase for each mblock seen on input.
static void
upmodrput(queue_t *q, mblk_t *mp)
{
upmodput(q, upmod_upcase(mp));
}
Here is the full example:
/* * This example demonstrates a minimum STREAMS module that honors flow control. * It converts all data bytes on the read side to the upper case. */ /* * Required include files. */ #include#include #include #include #include #include #include /* * Function prototypes. */ static int upmodopen(queue_t *, dev_t *, int, int, cred_t *); static int upmodclose(queue_t *); static void upmodput(queue_t *, mblk_t *); static void upmodrput(queue_t *, mblk_t *); static void upmodsrv(queue_t *); static mblk_t *upmod_upcase(mblk_t *mp); /* * Module linkage data */ static struct module_info upmod_minfo = { 2, /* mi_idnum */ "upmod", /* mi_idname */ 0, /* mi_minpsz */ INFPSZ, /* mi_maxpsz */ 4096, /* mi_hiwat */ 512 /* mi_lowat */ }; static struct qinit upmod_rinit = { (int (*)())upmodput, /* qi_putp */ (int (*)())upmodsrv, /* qi_srvp */ upmodopen, /* qi_qopen */ upmodclose, /* qi_qclose */ NULL, /* qi_qadmin */ &upmod_minfo, /* qi_minfo */ }; static struct qinit upmod_winit = { (int (*)())upmodrput, /* qi_putp */ (int (*)())upmodsrv, /* qi_srvp */ NULL, /* qi_qopen */ NULL, /* qi_qclose */ NULL, /* qi_qadmin */ &upmod_minfo, /* qi_minfo */ }; static struct streamtab upmod_info = { &upmod_rinit, /* st_rdinit */ &upmod_winit, /* st_wrinit */ }; static struct fmodsw fsw = { "upmod", &upmod_info, D_MP | D_MTPERQ }; /* * Module linkage information for the kernel. */ struct mod_ops mod_strmodops; static struct modlstrmod modlstrmod = { &mod_strmodops, "Example up-through module 1.0", &fsw }; static struct modlinkage modlinkage = { MODREV_1, (void *)&modlstrmod, NULL }; /* * Standard module entry points. */ int _init(void) { return (mod_install(&modlinkage)); } int _fini(void) { return (mod_remove(&modlinkage)); } int _info(struct modinfo *modinfop) { return (mod_info(&modlinkage, modinfop)); } /* * Actual module code. */ /* * STREAMS entry points. */ /* ARGSUSED */ static int upmodopen(queue_t *rq, dev_t *dev, int oflag, int sflag, cred_t *crp) { if (sflag != MODOPEN) return (EINVAL); /* Prevent duplicate opens */ if (rq->q_ptr != NULL) return (0); rq->q_ptr = WR(rq)->q_ptr = (void *)1; qprocson(rq); /* * At this point module is linked in the STREAM and can send/receive * messages. Its put/service procedures may execute at any time. */ return (0); } static int upmodclose(queue_t *rq) { qprocsoff(rq); rq->q_ptr = WR(rq)->q_ptr = NULL; /* * At this point module is disconnected from the STREAM and can * no longer receive messages. Its put or service procedures are not * running. */ return (0); } /* * Support routines. */ /* Put procedure */ static void upmodput(queue_t *q, mblk_t *mp) { /* * If the message is a high-priority message or there is no flow control * and there are no messages in the queue already, pass it forward, * otherwise queue. */ if (queclass(mp) == QPCTL || ((q->q_first == NULL) && canputnext(q))) putnext(q, mp); else (void) putq(q, mp); } /* * Support routines. */ static void upmodrput(queue_t *q, mblk_t *mp) { upmodput(q, upmod_upcase(mp)); } /* Read/write side service routine */ static void upmodsrv(queue_t *q) { mblk_t *mp; /* * Get messages from the service queue and pass them forward until flow * controlled. */ while ((mp = getq(q)) != NULL) { if (canputnext(q)) { putnext(q, mp); } else { (void) putbq(q, mp); break; } } } #ifndef islower #define islower(x) (((unsigned)(x) >= 'a') && ((unsigned)(x) <= 'z')) #endif #ifndef isupper #define isupper(x) (((unsigned)(x) >= 'A') && ((unsigned)(x) <= 'Z')) #endif #ifndef toupper #define toupper(x) (isupper(x) ? (x) : (unsigned)(x) - 'a' + 'A') #endif /* * Convert all ASCII chars in data blocks to upper case */ static mblk_t * upmod_upcase(mblk_t *passed_mp) { mblk_t *mp = passed_mp; for (; mp != NULL; mp = mp->b_cont) { if ((DB_TYPE(mp) == M_DATA) && (MBLKL(mp) > 0)) { uchar_t *p; for (p = mp->b_rptr; p < mp->b_wptr; p++) if (islower(*p)) *p = toupper(*p); } } return (passed_mp); }
Let us save it in file upmod.c, and compile it:
$ /usr/sfw/bin/gcc -c -m64 -o upmod.o -I/usr/include \
-O -D_KERNEL -D_SYSCALL32 -D_SYSCALL32_IMPL upmod.c
$ ld -r -o upmod upmod.o
Now we can install it. For example, on sparc system:
$ su
# cp upmod /kernel/strmod/sparcv9
# exit
$ strchg -h upmod
$ UPTIME
4:53PM UP 35 DAY(S), 4:38, 2 USERS, LOAD AVERAGE: 0.01, 0.01, 0.00
USER TTY LOGIN@ IDLE JCPU PCPU WHAT
USER PTS/1 8AUG0510DAYS 8 BASH
USER PTS/2 4:11PM 12 1 W
$ STRCHG -P
$ w
4:55pm up 35 day(s), 4:39, 2 users, load average: 0.00, 0.01, 0.00
User tty login@ idle JCPU PCPU what
user pts/1 8Aug0510days 8 bash
user pts/2 4:11pm 12 1 w
The output reminds me of the old Russian-made mainframes ES 1045 (Soviet clone of IBM 360/370 series) which could only output in all caps. Interestingly, at these times Russian-made computers usually used capital letters for English and lower letters for Russian. This was a precursor for KOI8-r encoding.
I want to understand my system
About a year ago I joined the NUMA project which was providing support for hierarchical locality groups (HLS). Solaris already understood two-level memory hierarchies when a piece of memory can be local or remote to a certain CPU. The project elaborated on this concept and brought a more fine-grained distance information. Now scheduler and VM subsystem could discover what is the closest memory or CPU, what is the one a bit further away, what is the one even further, etc. This is especially important for AMD Opteron machines with their HyperTransport architecture. For example, on a 4-way AMD system it may take up to two hops for a CPU to get to a memory on another node.
All this notion of locality is expressed in an abstraction called locality group or lgroup for short. The lgroup is just a set of resources (memory and CPU) that are not too far from each other. They form a hierarchy with leafs containing end-node resources and the root lgroup at the top containing all the system resources. Jonathan wrote a very good introduction explaining all the details and it has some very cool pictures, too. I strongly recommend reading it if you are interested in the way Solaris deals with NUMA challenges.
I came from a different towns and willages of Solaris and didn't know anything about these locality groups. Some of you know the feeling when you join a completely new project and everyone speaks a foreign language. To get a handle on the terminology and the basic abstractions I started with improving the little MDB support available for displaying lgroups. It basically allows you to have a quick look at all lgroups and the output looks like this (on a 4-CPU AMD box):
> ::lgrp
LGRPID ADDR PARENT PLATHAND #CPU CPUS
0 fffffffffbc1ccc0 0 DEFAULT 0
1 fffffffffbc0a2a0 fffffffffbc0a380 0 1 0
2 fffffffffbc0a310 fffffffffbc0a3f0 1 1 1
3 fffffffffbc0a380 fffffffffbc1ccc0 NULL 0
4 fffffffffbc0a3f0 fffffffffbc1ccc0 NULL 0
5 fffffffffbc0a460 fffffffffbc0a4d0 2 1 2
6 fffffffffbc0a4d0 fffffffffbc1ccc0 NULL 0
7 fffffffffbc0a540 fffffffffbc0a5b0 3 1 3
8 fffffffffbc0a5b0 fffffffffbc1ccc0 NULL 0
You can get a bit more details:
{
lgrp_id = 0
lgrp_latency = 0x86
lgrp_plathand = 0xbabecafe
lgrp_parent = 0
lgrp_reserved1 = 0
lgrp_childcnt = 0x4
lgrp_children = 0x158
lgrp_leaves = 0xa6
lgrp_set = [ 0xa6, 0x6 ]
lgrp_mnodes = 0x3
lgrp_nmnodes = 0x2
lgrp_reserved2 = 0
lgrp_cpu = 0
lgrp_cpucnt = 0
lgrp_chipcnt = 0
lgrp_chips = 0
lgrp_kstat = 0
}
{
lgrp_id = 0x1
lgrp_latency = 0x42
lgrp_plathand = 0
lgrp_parent = lgrp_space+0xe0
lgrp_reserved1 = 0
lgrp_childcnt = 0
lgrp_children = 0
lgrp_leaves = 0x2
lgrp_set = [ 0x2, 0x2 ]
lgrp_mnodes = 0x1
lgrp_nmnodes = 0x1
lgrp_reserved2 = 0
lgrp_cpu = cpus
lgrp_cpucnt = 0x1
lgrp_chipcnt = 0x1
lgrp_chips = cpu0_chip
lgrp_kstat = 0xffffffff839d35e0
}
...
You got the idea. It is quite useful for engineers who have to debug this staff, but a bit obscure for most other people. So I was wondering how could I see these lgroups in a simple manner without resorting to the kernel debugger. Luckily Jonathan created a very useful library which provided all the information need at the user level. The actual thing missing was an actual application displaying these information in human-readable form. I had enough motivation to write such program, but I really wanted a simple tool to play with since I didn't know what it was going to display and Perl was a good tool to play with prototypes. The only missing part was a glue code needed to access the C library from Perl application. This was a perfect chance to dive in the underworld of XS and XSUB and I described my experience before.
It turned out that the glue code in this case was relatively easy, once I figured out how to get around some of the h2xs roadblocks. After that the Perl part was really easy - the first prototype was about a page of Perl code. For example, the wollowing piece of code was producing the full list of lgroups in the system, or any subtree:
sub lgrp_lgrps($;$)
{
my $cookie = shift;
my $root = shift;
$root = lgrp_root($cookie) unless defined $root;
return unless defined $root;
my @children = lgrp_children($cookie, $root);
my @result;
# Concatenate root with subtrees for every children.
# Every subtree is obtained by calling lgrp_lgrps recursively with each of
# the children as the argument.
@result = @children ?
($root, map {lgrp_lgrps($cookie, $_)} @children) :
($root);
return (wantarray ? @result : scalar @result);
}
Once I had the glue code and the initially prototype it was relatively easy
to write a small program
that displayed the system lgroup hierarchy in a nice form. Here is a
different look at the same hierarchy as you saw below:
$ lgrpinfo -Ta
0
|-- 5
| CPUs: 0-2
| Lgroup resources: 1 2 3 (CPU); 1 2 3 (memory)
| `-- 1
| CPU: 0
| Memory: installed 2048 Mb, allocated 182 Mb, free 1866 Mb
|-- 6
| CPUs: 0 1 3
| Lgroup resources: 1 2 4 (CPU); 1 2 4 (memory)
| `-- 2
| CPU: 1
| Memory: installed 1599 Mb, allocated 25 Mb, free 1574 Mb
|-- 7
| CPUs: 0 2 3
| Lgroup resources: 1 3 4 (CPU); 1 3 4 (memory)
| `-- 3
| CPU: 2
| Memory: installed 2048 Mb, allocated 131 Mb, free 1917 Mb
`-- 8
CPUs: 1-3
Lgroup resources: 2 3 4 (CPU); 2 3 4 (memory)
`-- 4
CPU: 3
Memory: installed 2048 Mb, allocated 26 Mb, free 2022 Mb
It turned out that the Perl glue code had another application - it could be used to write useful tests in Perl for the HLS implementation. Here is a simple example:
######################################################################
# Each lgrp other than root should have a single parent and
# root should have no parents.
$fail = 0;
foreach my $l (lgrp_lgrps($c)) {
next if $l == $root;
my (@parents) = $c->parents($l) or
diag("lgrp_parents: $!");
my $nparents = @parents;
$fail++ unless $nparents == 1;
}
is($fail, 0, 'All non-leaf lgrps should have single parent');
@parents = $c->parents($root);
ok(!@parents, 'root should have no parents');
Once the tool was ready, Steve Lau wrote a set of tests for it (also in Perl).
And now you can play with it as well - the Perl module and the resulting lgrpinfo command are now available on the OpenSolaris web site. they are also available through the CPAN network.
STREAMS flow-control implementation
In my previous blog entry I discussed how to write a very simple STREAMS module that participates in the STREAMS flow control. It had two bugs in it - one intentional and one unintentional. Both were spotted by Yu Xiangning in the comments. The unintentional bug was in the setting of the flow control high and low water marks. This blog goes into more detail of the STREAMS flow control and discusses the actual implementation in Solaris.
STREAMS have a simple flow-control mechanism that is voluntary by design. Participating modules and drivers ask the next queue whether it wishes to accept more messages by calling canputnext(9f) and if the next queue is "full" (it has more data than is specified in the module high-water mark) the module enqueues the data with putq(9f) or putbq(9f). The putq() and putbq() functions place the message on the module's queue and arrange a service procedure to be run some time later. If the service procedure returns without processing all messages on its queue it will not be called again unless it is either enabled explicitly by qenable(9f) or implicitly when the amount of data queued in the next module drops below low-water mark.
All modules participating in the flow control must have a service routine. The flow control operates between the two nearest queues in a stream containing service procedures. Detailed description of the flow control is contained in the Solaris STREAMS Programming Guide. The excellent UNIX System V Network Programming contains a very good description of the flow control in section 9.2:
A stream is said to be flow-controlled when its queues become full. When the number of bytes of data in the message on a queue becomes greater than the queue's high-water mark, the queue is considered full. Flow control is an advisory state where the processing element passing messages to the full queue stops sending messages and places them on its own queue. This way, flow control can propagate from one end of the stream to the other.
At the stream head, when a process tries to write to a stream whose topmost write queue with a service procedure is full, the process goes to sleep until the number of bytes of data stored in the queue reaches the queue's low-water mark. At this point the queue is no longer flow controlled. Note the distinction between the queue being full and being flow-controlled. The queue is only full as long as the amount of data it contains is over its high-water mark, but the queue remains flow controlled after the amount of data falls below the high-water mark. Of course, if the high and low-water marks are set to the same value, then there is no such distinction.
The canputnext()
function is pretty simple. it finds the next queue with a service procedure and
checks whether it has QFULL flag set. If QFULL is not
set, it returns 1 and if it is set, it sets QWANTW flag and returns
0. The QWANTW tells that another queue wants to place messages
here, so it should be backenabled when the QFULL flag is dropped:
int
canputnext(queue_t *q)
{
/* get next module forward with a service queue */
q = q->q_next->q_nfsrv;
if (!(q->q_flag & QFULL)) {
return (1);
} else {
q->q_flag |= QWANTW;
return (0);
}
}
The putq()
puts messages on a driver's queue. The message is placed after any other
messages of the same priority, and flow control parameters are updated. If
QNOENB is not set, the service routine is enabled:
/*
* Put a message on a queue.
*
* Messages are enqueued on a priority basis. The priority classes
* are HIGH PRIORITY (type >= QPCTL), PRIORITY (type < QPCTL && band > 0),
* and B_NORMAL (type < QPCTL && band == 0).
*
* Add appropriate weighted data block sizes to queue count.
* If queue hits high water mark then set QFULL flag.
*
* If QNOENAB is not set (putq is allowed to enable the queue),
* enable the queue only if the message is PRIORITY,
* or the QWANTR flag is set (indicating that the service procedure
* is ready to read the queue. This implies that a service
* procedure must NEVER put a high priority message back on its own
* queue, as this would result in an infinite loop (!).
*/
int
putq(queue_t *q, mblk_t *bp)
{
mblk_t *tmp;
int bytecnt = 0, mblkcnt = 0;
/*
* If queue is empty, add the message and initialize the pointers.
* Otherwise, adjust message pointers and queue pointers.
*/
if (!q->q_first) {
bp->b_next = NULL;
bp->b_prev = NULL;
q->q_first = bp;
q->q_last = bp;
} else {
tmp = q->q_last;
bp->b_next = NULL;
bp->b_prev = tmp;
tmp->b_next = bp;
q->q_last = bp;
}
/* Get message byte count for q_count accounting */
for (tmp = bp; tmp; tmp = tmp->b_cont) {
bytecnt += (tmp->b_wptr - tmp->b_rptr);
mblkcnt++;
}
q->q_count += bytecnt;
q->q_mblkcnt += mblkcnt;
if ((q->q_count >= q->q_hiwat) ||
(q->q_mblkcnt >= q->q_hiwat)) {
q->q_flag |= QFULL;
}
/* Don't enable the queue that was noenable(9f)-ed */
if ((canenable(q) && (q->q_flag & QWANTR)))
qenable(q);
return (1);
}
getq()
function fetches and returns the first message from the queue. As a side effect
it may trigger back-enabling of the queues which were previously flow
controlled because this queue had too much data. The back-enabling process will
arrange the service routines of a previously flow-controlled module to be
called. It will call getq() on its own queue which may cause
further back-enabling, propagating the release of the flow control backwards
down the stream.
/*
* Get a message off head of queue
*
* If queue has no buffers then mark queue
* with QWANTR. (queue wants to be read by
* someone when data becomes available)
*
* If there is something to take off then do so.
* If queue falls below hi water mark turn off QFULL
* flag. Decrement weighted count of queue.
* Also turn off QWANTR because queue is being read.
*
* The queue count is maintained on a per-band basis.
* Priority band 0 (normal messages) uses q_count,
* q_lowat, etc.
*
* If queue count is below the lo water mark and QWANTW
* is set, enable the closest backq which has a service
* procedure and turn off the QWANTW flag.
*
* A note on the use of q_count and q_mblkcnt:
* q_count is the traditional byte count for messages that
* have been put on a queue. Documentation tells us that
* we shouldn't rely on that count, but some drivers/modules
* do. What was needed, however, is a mechanism to prevent
* runaway streams from consuming all of the resources,
* and particularly be able to flow control zero-length
* messages. q_mblkcnt is used for this purpose. It
* counts the number of mblk's that are being put on
* the queue. The intention here, is that each mblk should
* contain one byte of data and, for the purpose of
* flow-control, logically does. A queue will become
* full when EITHER of these values (q_count and q_mblkcnt)
* reach the highwater mark. It will clear when BOTH
* of them drop below the highwater mark. And it will
* backenable when BOTH of them drop below the lowwater
* mark.
* With this algorithm, a driver/module might be able
* to find a reasonably accurate q_count, and the
* framework can still try and limit resource usage.
*/
mblk_t *
getq(queue_t *q)
{
mblk_t *bp;
int band = 0;
bp = getq_noenab(q);
if (bp != NULL)
band = bp->b_band;
qbackenable(q, band);
return (bp);
}
The getq_noenab()
is a STREAMS framework internal function which does the actual job of fetching
the message but doesn't deal with back-enabling the STREAM.
/*
* Like getq() but does not backenable. The caller must call qbackenable()
* after it is done with accessing the queue.
*/
mblk_t *
getq_noenab(queue_t *q)
{
mblk_t *bp;
mblk_t *tmp;
int bytecnt = 0, mblkcnt = 0;
if ((bp = q->q_first) == 0) {
q->q_flag |= QWANTR;
} else {
if ((q->q_first = bp->b_next) == NULL)
q->q_last = NULL;
else
q->q_first->b_prev = NULL;
/* Get message byte count for q_count accounting */
for (tmp = bp; tmp; tmp = tmp->b_cont) {
bytecnt += (tmp->b_wptr - tmp->b_rptr);
mblkcnt++;
}
q->q_count -= bytecnt;
q->q_mblkcnt -= mblkcnt;
if ((q->q_count < q->q_hiwat) &&
(q->q_mblkcnt < q->q_hiwat)) {
q->q_flag &= ~QFULL;
q->q_flag &= ~QWANTR;
bp->b_next = NULL;
bp->b_prev = NULL;
}
return (bp);
}
The
qbackenable() function is another STREAMS internal function
that checks whether queue back-enabling is required and calls the actual
function backenable() doing back-enabling.
/*
* Determine if a backenable is needed after removing a message in the
* specified band.
*/
void
qbackenable(queue_t *q, int band)
{
int backenab = 0;
if (band == 0 && (q->q_flag & QWANTW) == 0)
return;
if (band == 0) {
if (q->q_lowat == 0 || (q->q_count < q->q_lowat &&
q->q_mblkcnt < q->q_lowat)) {
backenab = q->q_flag & QWANTW;
}
} else {
...
}
if (backenab & QWANTW) {
q->q_flag &= ~QWANTW;
backenable(q, band);
}
}
The
backenable() is a STREAMS internal function that finds the
nearest back queue with service procedure and enables it. The
enabling involves arranging the service routine to be run sometime in
the future. It is handled by the qenable_locked()
function that is beyond the scope of this blog.
/*
/*
* enable first back queue with svc procedure.
* Use pri == -1 to avoid the setqback
*/
void
backenable(queue_t *q, int pri)
{
queue_t *nq;
/* find nearest back queue with service proc */
for (nq = backq(q); nq && !nq->q_qinfo->qi_srvp; nq = backq(nq))
;
if (nq) {
if (pri != -1)
setqback(nq, pri);
qenable_locked(nq);
}
}
Grown up Do-Nothing STREAMS Module
In my earlier entry I played with a simple STREAMS module that does nothing useful, but just passes messages back and forth. Now I want to extend this to a respectful STREAMS module fully participating in the STREAMS flow control. This means that in addition to the open/close entry points the module should define read and write put procedures and a service procedure. Previous module was called "nullmod", this module will be called "passmod".
Let us start with the put procedure. It can be as simple as
void
passmodput(queue_t *q, mblk_t *mp)
{
putnext(q, mp);
}
What we now want to do is to check that the next module in the STREAM can accept
our message. We do this by calling canputnext(9f) and using
putq(9f) if canputnext() fails:
void
passmodput(queue_t *q, mblk_t *mp)
{
if (canputnext(q)) {
putnext(q, mp);
} else {
(void) putbq(q, mp);
}
}
Here is the service routine. It gets all messages one by one and passes it down down the STREAM observing the flow control:
/* Read/write side service procedure. */
static void
passmodsrv(queue_t *q)
{
mblk_t *mp;
/*
* Get messages from the service queue and pass them forward until flow
* controlled.
*/
while ((mp = getq(q)) != NULL) {
if (canputnext(q)) {
putnext(q, mp);
} else {
(void) putbq(q, mp);
break;
}
}
}
Now, what happens if by the time we enter the put procedure there are already some messages enqueued? We definitely do not want to call putnext() on the new message since it may arrive before the earlier messages and violate the message ordering in the STREAM. To defend against this problem we revise the put procedure a bit:
void
passmodput(queue_t *q, mblk_t *mp)
{
if ((q->q_first == NULL) && canputnext(q)) {
putnext(q, mp);
} else {
(void) putbq(q, mp);
}
}
Now if there are any messages already enqueued we will continue enqueueing all
new messages.This code is very straightforward, but a bit naive. The
complication comes from the high-priority messages (which can be passed using
RS_HIPRI flags to the putmsg(2) function. When you
call putq() with the high-priority message, the STREAMS framework
immediately enables the queue and calls its service procedure which
will cause an infinite loop, so we should be a bit more accurate and always pass
high priority messages. This means that we don't need to enqueue them in the
first place, so we can rewrite the put procedure again to fix the problem:
void
passmodput(queue_t *q, mblk_t *mp)
{
/*
* If the message is a high-priority message or there is no flow control
* and there are no messages in the queue already, pass it forward,
* otherwise enqueue. High priority message should be always passed
* forward.
*/
if (queclass(mp) == QPCTL ||
((q->q_first == NULL) && canputnext(q)))
putnext(q, mp);
else
(void) putq(q, mp);
}
Now we have all the components to construct a fully-functioning STREAMS module which correctly implements flow control. The full code is below.
NOTE: The code below contains a subtle bug. Try to find it before I explain the bug in the next blog entry.
/* * This example demonstrates a minimum STREAMS module that honors flow control. */ /* * Required include files. */ #include#include #include #include #include #include #include /* * Function prototypes. */ static int passmodopen(queue_t *, dev_t *, int, int, cred_t *); static int passmodclose(queue_t *); static void passmodput(queue_t *, mblk_t *); static void passmodsrv(queue_t *); /* * Module linkage data */ static struct module_info passmod_minfo = { 2, /* mi_idnum */ "passmod", /* mi_idname */ 0, /* mi_minpsz */ INFPSZ, /* mi_maxpsz */ 0, /* mi_hiwat */ 0 /* mi_lowat */ }; static struct qinit passmod_rinit = { (int (*)())passmodput, /* qi_putp */ (int (*)())passmodsrv, /* qi_srvp */ passmodopen, /* qi_qopen */ passmodclose, /* qi_qclose */ NULL, /* qi_qadmin */ &passmod_minfo, /* qi_minfo */ }; static struct qinit passmod_winit = { (int (*)())passmodput, /* qi_putp */ (int (*)())passmodsrv, /* qi_srvp */ NULL, /* qi_qopen */ NULL, /* qi_qclose */ NULL, /* qi_qadmin */ &passmod_minfo, /* qi_minfo */ }; static struct streamtab passmod_info = { &passmod_rinit, /* st_rdinit */ &passmod_winit, /* st_wrinit */ }; static struct fmodsw fsw = { "passmod", &passmod_info, D_MP }; /* * Module linkage information for the kernel. */ struct mod_ops mod_strmodops; static struct modlstrmod modlstrmod = { &mod_strmodops, "Example pass-through module 1.0", &fsw }; static struct modlinkage modlinkage = { MODREV_1, (void *)&modlstrmod, NULL }; /* * Standard module entry points. */ int _init(void) { return (mod_install(&modlinkage)); } int _fini(void) { return (mod_remove(&modlinkage)); } int _info(struct modinfo *modinfop) { return (mod_info(&modlinkage, modinfop)); } /* * Actual module code. */ /* * STREAMS entry points. */ /* ARGSUSED */ static int passmodopen(queue_t *rq, dev_t *dev, int oflag, int sflag, cred_t *crp) { if (sflag != MODOPEN) return (EINVAL); /* Prevent duplicate opens */ if (rq->q_ptr != NULL) return (0); rq->q_ptr = WR(rq)->q_ptr = (void *)1; qprocson(rq); /* * At this point module is linked in the STREAM and can send/receive * messages. Its put/service procedures may execute at any time. */ return (0); } static int passmodclose(queue_t *rq) { qprocsoff(rq); rq->q_ptr = WR(rq)->q_ptr = NULL; /* * At this point module is disconnected from the STREAM and can * no longer receive messages. Its put or service procedures are not * running. */ return (0); } /* * Support routines. */ /* Read/write side put procedure. */ static void passmodput(queue_t *q, mblk_t *mp) { /* * If the message is a high-priority message or there is no flow control * and there are no messages in the queue already, pass it forward, * otherwise enqueue. High priority message should be always passed * forward. */ if (queclass(mp) == QPCTL || ((q->q_first == NULL) && canputnext(q))) putnext(q, mp); else (void) putq(q, mp); } /* Read/write side service procedure. */ static void passmodsrv(queue_t *q) { mblk_t *mp; /* * Get messages from the service queue and pass them forward until flow * controlled. */ while ((mp = getq(q)) != NULL) { if (canputnext(q)) { putnext(q, mp); } else { (void) putbq(q, mp); break; } } }
It is common for a kernel programmer to postpone processing of some tasks and delegate their execution to another kernel thread. There may be several reasons for doing this:
In all these cases programmer, in essense, needs to execute a piece of code (task) in a different context, where context usually means another kernel thread with different set of locks held and, possibly, a different priority.
Until introduction of task queues in Solaris 8 there was no generic OS facility for such in-kernel context change. Every subsystem used its own ad-hoc mechanisms, usually utilizing ``worker threads'' together with a list of jobs to give them. The task queues interface abstracts common code out of these mechanisms and provides simple way of scheduling asynchronous tasks.
A task queue consists of a list of tasks, together with one or more threads to service the list. If a task queue has a single service thread, all tasks are guaranteed to execute in the order they were dispatched. Otherwise they can be executed in any order. Note that since tasks are placed on a list, execution of one task and should not depend on the execution of another task or a deadlock may occur. A taskq created with a single servicing thread guarantees that all the tasks are serviced in the order in which they are scheduled.
Kernel users should use the documented DDI interface for all taskq operations. These interfaces are defined in the usr/src/uts/common/sys/sunddi.h header file. The exported interface consists of the following functions:
Every taskq created in the system keeps a set of kstat counters associated with it. Try running the following command on your system:
$ kstat -c taskq
module: unix instance: 0
name: ata_nexus_enum_tq class: taskq
crtime 53.877907833
executed 0
maxtasks 0
nactive 1
nalloc 0
priority 60
snaptime 258059.249256749
tasks 0
threads 1
totaltime 0
module: unix instance: 0
name: callout_taskq class: taskq
crtime 0
executed 13956358
maxtasks 4
nactive 4
nalloc 0
priority 99
snaptime 258059.24981709
tasks 13956358
threads 2
totaltime 120247890619
...
The kstat information above includes:
You can use the power of the kstat command to observe how some counter increases over time:
$ kstat -p unix:0:callout_taskq:tasks 1 5
unix:0:callout_taskq:tasks 13994642
unix:0:callout_taskq:tasks 13994711
unix:0:callout_taskq:tasks 13994784
unix:0:callout_taskq:tasks 13994855
unix:0:callout_taskq:tasks 13994926
...
The taskq implementation also provides several useful SDT probes: All the probes described below have two arguments: the taskq pointer and the pointer to the pointer to the taskq_ent_t structure. It can be used to extract the function and the argument from the D script.
Developers can use these probes to collect precise timing information about individual task queues and individual tasks being executed through them. For example, the following script will print what functions were scheduled via task queues for every 10 seconds:
#!/usr/sbin/dtrace -qs
sdt:genunix::taskq-enqueue
{
this->tq = (taskq_t *)arg0;
this->tqe = (taskq_ent_t *) arg1;
@[this->tq->tq_name,
this->tq->tq_instance,
this->tqe->tqent_func] = count();
}
tick-10s
{
printa ("%s(%d): %a called %@d times\n", @);
trunc(@);
}
Running this on my desktop produced the following output1:
callout_taskq(1): genunix`callout_execute called 51 times
callout_taskq(0): genunix`callout_execute called 701 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 40 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 256 times
callout_taskq(0): genunix`callout_execute called 702 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 28 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 228 times
callout_taskq(0): genunix`callout_execute called 706 times
callout_taskq(1): genunix`callout_execute called 24 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 141 times
callout_taskq(0): genunix`callout_execute called 708 times
Suppose that two friends, Bob and Alice are staying in the cafeteria line with Alice standing behind Bob. The cashier checks Bobs' tray and it turns out that Bob doesn't have enough money, so he wants to borrow from Alice. But Alice is not sure whether she has enough cash until she knows the cost of her lunch. This is a typical deadlock situation - both Bob and Alice can not make any forward progress waiting for each other. The same kind of deadlock may occur if two tasks A and B are placed on a queue which is served by a single thread when there is a resource dependency between A and B. One way to prevent such a deadlock is to guarantee that A and B are processed by two different threads, so that when A stalls for B the thread processing A will block until B makes enough progress and can provide the needed resource to B.
Dynamic task queues provide exactly such deadlock-free way of scheduling potentially dependent tasks on the same queues. They guarantee that every task is processed by a separate thread. Since the amount of tasks that can be scheduled at the same time is not known in advance, dynamic task queues maintain a dynamic thread pool that grows when the workload increases and shrinks when the workload dries off.
Dynamic task queues can not (yet) be used via the DDI interfaces. Some kernel
subsystems use the internal taskq calls directly to create and use
dynamic task queues. The system also maintains one shared dynamic task queue
called system_taskq. It can be used by specifying
system_taskq as the taskq argument to the
taskq_dispatch() function. It is really a good idea to also add
"TQ_NOSLEEP | TQ_NOQUEUE" to the flags when using
system_taskq.
Each taskq is implemented as a list of tasks protected by a per-taskq lock. One or more worker threads take tasks one by one and execute them by calling f(a) and then sleep, waiting for new entries. A taskq created with a single servicing thread has an important property: it guarantees that all its tasks are executed in the order they are scheduled. When a task queue is created with several servicing threads, task execution order is not predictable.
If you want to look at the actual implementation you need to look at the following files:
The first taskq implementation was done by Jeff Bonwick for Solaris 8. It was successfully used to replace many calls to the low-level thread_create() function. I added Dynamic Task Queues in Solaris 9 and used them to completely re-implement the STREAMS scheduler. In Solaris 10 I added DDI interfaces for task queues and also added kstat counters and DTrace probes.
1 For curious minds: the callout_taskq is used to handle system timers. As an exercise in your DTrace skills, try to figure out what actual timers are firing on each CPU. Hint - use the callout-start SDT probe, which has a pointer to the callout_t structure as its sole argument.
Technorati Tag: Solaris
Technorati Tag: OpenSolaris
Technorati Tag: DTrace
Technorati Tag: Kernel
OpenSolaris includes a very powerful and modular debugger MDB that is an invaluable tool for analyzing crash dumps and live systems. It comes with a comprehensive manual which is a great read. Here is a brief information for brave souls having to debug STREAMS-related issues.
The most useful MDB dcmd for most module and driver developers is
::queue. Typing ::help queue will show the full usage
information:
NAME
queue - filter and display STREAM queue
SYNOPSIS
addr ::queue [-q|v] [-m mod] [-f flag] [-F flag] [-s syncq_addr]
DESCRIPTION
Print queue information for a given queue pointer.
Without the address of a "queue_t" structure given, print information about al
l
queues in the "queue_cache".
Options:
-v: be verbose - print symbolic flags falues
-q: be quiet - print queue pointer only
-f flag: print only queues with flag set
-F flag: print only queues with flag NOT set
-m modname: print only queues with specified module name
-s syncq_addr: print only queues which use specified syncq
Available conversions:
q2rdq: given a queue addr print read queue pointer
q2wrq: given a queue addr print write queue pointer
q2otherq: given a queue addr print other queue pointer
q2syncq: given a queue addr print syncq pointer (::help syncq)
q2stream: given a queue addr print its stream pointer
(see ::help stream and ::help stdata)
To walk q_next pointer of the queue use
queue_addr::walk qnext
ATTRIBUTES
Target: kvm
Module: genunix
Interface Stability: Unstable
If you just type ::queue, MDB will print information about all
opened queue instances 1 in the
system. For example, on my desktop:
> ::queue
ADDR MODULE FLAGS NBLK
d0353008 fifostrrhead 044032 0 00000000
d0353180 ip 204032 0 00000000
d03532f8 consms 202032 0 00000000
d0353470 strrhead 044032 0 00000000
d03535e8 tcp 204032 0 00000000
d0353760 strrhead 044032 0 00000000
d03538d8 conskbd 20c032 0 00000000
d0353a50 wc 242032 0 00000000
d0353bc8 ip 204032 0 00000000
d0353d40 strrhead 044032 0 00000000
d033f010 usbms 002032 0 00000000
d033f188 hid 201032 0 00000000
d033f300 consms 242032 0 00000000
d033f478 rts 108832 0 00000000
...
What you see in the output is the actual read queue address for each open module or instance, the module name, queue flags, number of messages on the module queue and the pointer to the first message in the queue. You can translate hex flag values into symbolic names adding -v option:
> ::queue -v
ADDR MODULE FLAGS NBLK
d0353008 fifostrrhead 044032 0 00000000
|
+--> QWANTR Someone wants to read Q
QREADR This is the reader (first) Q
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QEND last queue in stream
d0353180 ip 204032 0 00000000
|
+--> QWANTR Someone wants to read Q
QREADR This is the reader (first) Q
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QISDRV the Queue is attached to a driver
...
You can ask about specific queue by providing its address for the
::queue command:
> d0353180::queue -v
ADDR MODULE FLAGS NBLK
d0353180 ip 204032 0 00000000
|
+--> QWANTR Someone wants to read Q
QREADR This is the reader (first) Q
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QISDRV the Queue is attached to a driver
What if you want to find all open instances of a specific module? This is easy
with the -m flag. For example, to find all open instances of IP you
can type
> ::queue -m ip
d0353180
d0353bc8
d197e5f8
d2263020
...
Notice that when you use any of the filtering flags, the
::queuedcmd prints only the address of matching queues. You can
still get more detailed information by pipelining the output to the
::queue dcmd:
> ::queue -m ip | ::queue
ADDR MODULE FLAGS NBLK
d0353180 ip 204032 0 00000000
d0353bc8 ip 204032 0 00000000
d197e5f8 ip 244032 0 00000000
d2263020 ip 204032 0 00000000
...
You can use filtering options to find all queues with specific flag value set:
> ::queue -f QISDRV|::queue
ADDR MODULE FLAGS NBLK
d0353180 ip 204032 0 00000000
d03532f8 consms 202032 0 00000000
d03535e8 tcp 204032 0 00000000
d03538d8 conskbd 20c032 0 00000000
d0353a50 wc 242032 0 00000000
...
prints information about all driver queues. What if you want information about modules and not drivers? Easy:
> ::queue -F QISDRV|::queue
ADDR MODULE FLAGS NBLK
d0353008 fifostrrhead 044032 0 00000000
d0353470 strrhead 044032 0 00000000
d0353760 strrhead 044032 0 00000000
d0353d40 strrhead 044032 0 00000000
d033f010 usbms 002032 0 00000000
d033f478 rts 108832 0 00000000
...
Similarly you can find all queues which are flow controlled. Here is an example from a real core dump:
> ::queue -f QFULL|::queue
ADDR MODULE FLAGS NBLK
ffffffff83e9d000 strrhead 04403c 64 ffffffff953c0600
ffffffff82548018 strrhead 04403c 331 ffffffff949c2680
ffffffff85f4ba48 timod 00083c 7 ffffffffbfa04140
ffffffff83acf2e8 timod 00083c 3 ffffffff84659f00
Careful reader will notice that this will print information only about flow
controlled read side queues. How about write side queues? Here comes a useful
little ::q2wrq dcmd:
> ::queue -q | ::q2wrq | ::queue -f QFULL|::queue -v
ADDR MODULE FLAGS NBLK
ffffffff83cb0388 tl 24402c 205 fffffe80e3017040
|
+--> QWANTW Someone wants to write Q
QFULL Q is considered full
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QEND last queue in stream
QISDRV the Queue is attached to a driver
The ::q2wrq dcmd simply maps the read queue pointer to the write
queue pointer. The ::q2rdq performs the opposite mapping and the
::q2otherq dcmd maps read queue pointer to the write queue pointer
and visa versa.
::q2stream dcmd which maps a queue
pointer to the stream head pointer. You can use a nice ::stream
dcmd to display the whole stream graphically:
> ffffffff83acf2e8::q2stream|::stream
+-----------------------+-----------------------+
| 0xffffffff82548110 | 0xffffffff82548018 |
| strwhead | strrhead |
| | |
| cnt = 0t0 | cnt = 0t57420 |
| flg = 0x00004022 | flg = 0x0004403c |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffffff83acf3e0 | 0xffffffff83acf2e8 |
| timod | timod |
| | |
| cnt = 0t0 | cnt = 0t528 |
| flg = 0x00000822 | flg = 0x0000083c |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffffff83acf670 | 0xffffffff83acf578 |
| udp | udp |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x00000822 | flg = 0x00000832 |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffffff82a86128 | 0xffffffff82a86030 |
| ip | ip |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x00244022 | flg = 0x00204032 |
+-----------------------+-----------------------+
You can see all read and write queues of the stream and their state. The read side stream head is not very happy - it has 57420 bytes hanging around and blocking another 528 bytes in timod.
>p> Another way to look at the stream is to walk the read or write side using theqnext walker:
> 0xffffffff82548110::walk qnext|::queue
ADDR MODULE FLAGS NBLK
ffffffff82548110 strwhead 004022 0 0000000000000000
ffffffff83acf3e0 timod 000822 0 0000000000000000
ffffffff83acf670 udp 000822 0 0000000000000000
ffffffff82a86128 ip 244022 0 0000000000000000
> 0xffffffff82a86030::walk qnext|::queue
ADDR MODULE FLAGS NBLK
ffffffff82a86030 ip 204032 0 0000000000000000
ffffffff83acf578 udp 000832 0 0000000000000000
ffffffff83acf2e8 timod 00083c 3 ffffffff84659f00
ffffffff82548018 strrhead 04403c 331 ffffffff949c2680
A careful reader will notice that the ::stream dcmd displays how
many bytes bytes are in the queue while ::queue displays
ho many messages are there. Do we want to know what are these messages?
here comes the next useful dcmd.
::mblk dcmd. Typing
::help mblk will show you all the gory details:
NAME
mblk - print an mblk
SYNOPSIS
addr ::mblk [-q|v] [-f|F flag] [-t|T type] [-l|L|B len] [-d dbaddr]
DESCRIPTION
Print mblock information for a given mblk pointer.
Without the address, print information about all mblocks.
Fields printed:
ADDR: mblk address
FL: Flags
TYPE: Type of corresponding dblock
LEN: Data length as b_wptr - b_rptr
BLEN: Dblock space as db_lim - db_base
RPTR: Read pointer
DBLK: Dblock pointer
Options:
-v: be verbose - print symbolic flags falues
-q: be quiet - print mblk pointer only
-d dbaddr: print mblks with specified dblk address
-f flag: print only mblks with flag set
-F flag: print only mblks with flag NOT set
-t type: print only mblks of specified db_type
-T type: print only mblks other then the specified db_type
-l len: tprint only mblks with MBLKL == len
-L len: print only mblks with MBLKL <= len
-G len: print only mblks with MBLKL >= len
-b len: print only mblks with db_lim - db_base == len
ATTRIBUTES
Target: kvm
Module: genunix
Interface Stability: Unstable
It is easy to look at a specific message:
> ffffffff949c2680::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff949c2680 0 proto 56 80 ffffffff88f145b0 ffffffff88f14540
This output shows us that this is an M_PROTO message with 56 bytes of
information starting at address ffffffff88f145b02. We know that messages like to hang out together.
How do we print the whole b_cont chain? Like this:
> ffffffff949c2680::walk b_cont|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff949c2680 0 proto 56 80 ffffffff88f145b0 ffffffff88f14540
ffffffff943b5240 0 data 108 144 ffffffff9868878c ffffffff98688700
Now we can see both the M_PROTO and the attached M_DATA messages, and we can see
that UDP is sending 108 bytes of data upstream.
Similarly, the b_next walker will follow the b_next
chain:
> ffffffff84659f00::walk b_next|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff84659f00 0 proto 56 80 ffffffff94257c70 ffffffff94257c00
fffffe80e3078340 0 proto 56 80 ffffffff94257bb0 ffffffff94257b40
ffffffff84658a40 0 proto 56 80 ffffffff942575b0 ffffffff94257540
And, of course, we can combine both in the pipeline:
> ffffffff84659f00::walk b_next|::walk b_cont|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffffff84659f00 0 proto 56 80 ffffffff94257c70 ffffffff94257c00
ffffffff84659300 0 data 120 208 ffffffff82fc144c ffffffff82fc13c0
fffffe80e3078340 0 proto 56 80 ffffffff94257bb0 ffffffff94257b40
ffffffff846599c0 0 data 120 208 ffffffff8465120c ffffffff84651180
ffffffff84658a40 0 proto 56 80 ffffffff942575b0 ffffffff94257540
ffffffff84658b00 0 data 120 208 ffffffff84651d4c ffffffff84651cc0
If you are really curious, you can see all allocated
messages by simply typing ::mblk:
> ::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
d6d0e000 0 proto 24 64 cfedaf00 cfedaec0
d6d0e020 0 data 0 8192 d377d400 d23d7c00
d6d0e040 0 data 3 64 dc14d580 dc14d540
d6d0e060 0 proto 156 320 cfedbd00 cfedbcc0
d6d0e080 0 data 0 64 d366a9c4 d366a980
d6d0e0a0 0 data 80 320 cfedbe80 cfedbe40
d6d0e0c0 0 proto 24 64 cfe5e0c0 cfe5e080
d6d0e0e0 0 data 115 320 cfe490e7 cfe49080
As with the ::queue command you can filter by the message type. For
example, you can look at M_DATA messages3 only:
> ::mblk -t M_DATA|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
d6d0e020 0 data 0 8192 d377d400 d23d7c00
d6d0e040 0 data 16 64 dc14d580 dc14d540
d6d0e080 0 data 0 64 d366a9c4 d366a980
d6d0e0a0 0 data 80 320 cfedbe80 cfedbe40
d6d0e0e0 0 data 115 320 cfe490e7 cfe49080
d6d0e140 0 data 0 64 d77c7480 d77c7440
Or you can find all 0-bytes messages:
> ::mblk -l 0|::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
d6d0e020 0 data 0 8192 d377d400 d23d7c00
d6d0e040 0 data 0 64 dc1ab714 dc1ab6c0
d6d0e080 0 data 0 64 d366a9c4 d366a980
d6d0e140 0 data 0 64 d77c7480 d77c7440
...
In my opinion these example shows why MDB is really cool for kernel debugging. It can present huge amount of information in an informative way. Want to know how many 1-byte messages are floating around your system?
> ::mblk -l 0t1 ! wc -l
51
1The
::queue command will print all read queues.
2The UDP module talks a protocol called TPI, so the M_PROTO message is one of the TPI primitives. The first part of the primitive is always its ID, we can get this by typing
> ffffffff88f145b0/D
0xffffffff88f145b0: 20
Looking at usr/src/uts/common/sys/tihdr.h
we see that this is the T_UNITDATA_IND message (makes sense). We can print the
whole message now:
> ffffffff88f145b0::print 'struct T_unitdata_ind'
{
PRIM_type = 0x14
SRC_length = 0x10
SRC_offset = 0x14
OPT_length = 0x14
OPT_offset = 0x24
}
3 Underground hackers special:
> ::mblk -t M_DATA|::print mblk_t b_rptr|/s
Jun 15 09:57:16 lpr[13390]: [ID 575460 FACILITY_AND_PRIORITY] net_response(18)
NOTICE: NFS4 FACT SHEET:
Action: NR_CLIENTID
NFS4 error: NFS4ERR_STALE_CLIENTID
Suspected server reboot.
Jun 15 09:57:16 lpr[13390]: [ID 626751 FACILITY_AND_PRIORITY] net_write(18, 0x8045478, 12)
NOTICE: [NFS4][Server: onnv.sfbay][Mntpt: /ws/onnv-gate/usr]NFS server onnv.sfbay not responding; still trying
·gp0Jun 15 09:57:16 lpr[13390]: [ID 458697 FACILITY_AND_PRIORITY] net_open(bulka.SFBay, 5)
NOTICE: [NFS4][Server: onnv.sfbay][Mntpt: /ws/onnv-gate/usr]NFS server onnv.sfbay ok
...
Prints the content of all the data messages. I can see eyes flashing in the darkness!
Now, when the OpenSolaris is finally a reality, engineers can really start talking about the interesting staff. We spend most of our time dealing with code and it quite difficult to talk about what you do without being able to provide examples. Now is the time for the real technical blogging instead of the hand-waving.
I spent quite a lot of my time at Sun hacking STREAMS internals - things that almost no one ever notices, except for some unlucky folks who run into some nasty issues. Here I want to continue the sewer tour, started by Bryan Cantrill and take interested visitors to some of the STREAMS sewers.
Our tour will start with a few lines of code in the putnext() function. In the original implementation it was just a simple macro calling the put procedure of the next module in the STREAM:
#define putnext(q, mp) ((*(q)->q_next->q_qinfo->qi_putp)((q)->q_next, (mp)))
In Solaris it evolved into a rather complicated code1. Its inner workings deserve quite a few separate blog entries, but now we will take a side tour to a small piece of code that may catch your attention:
/*
* If there are writers or exclusive waiters, there is not much
* we can do. Place the message on the syncq and schedule a
* background thread to drain it.
*
* Also if we are approaching end of stack, fill the syncq and
* switch processing to a background thread - see comments on
* top.
*/
if ((flags & (SQ_STAYAWAY|SQ_EXCL|SQ_EVENTS)) ||
(sq->sq_needexcl != 0) || PUT_STACK_NOTENOUGH()) {
/*
* NOTE: qfill_syncq will need QLOCK. It is safe to drop
* SQLOCK because positive sq_count keeps the syncq from
* closing.
*/
mutex_exit(SQLOCK(sq));
qfill_syncq(sq, qp, mp);
/*
* NOTE: after the call to qfill_syncq() qp may be
* closed, both qp and sq should not be referenced at
* this point.
*
* This ASSERT is located here to prevent stack frame
* consumption in the DEBUG code.2
*/
ASSERT(sqciplock == NULL);
return;
}
The code uses a notion of a syncq, which is a synchronization abstraction used by STREAMS. Whenever some module calls the putnext() function, the code checks for various conditions and if everything seems all-right, it just calls the put procedure of the next module in the STREAM. If something doesn't seem right (e.g. the target module is busy processing other messages), the message is placed on the special queue in the syncq and the framework arranges a different kernel thread to pass the enqueued message to the next module when the module is ready3. The important observation is that this kernel thread executes has its own fresh stack. The highlighted code above hijacks this normal STREAMS mechanism to protect from a rather nasty problem - kernel stack overflow.
Note that STREAMS is a very flexible framework that allows many modules to be linked together in a chain (or even something more complex like a tree or U-pipe). Every new Solaris release includes some new STREAMS modules and drivers that often play in concert to provide some exciting new functionality. The downside is that the chain-calling of module's put procedures creates really long call chains on the stack and sometimes causes kernel panics caused by the stack overflow.
Soon after Solaris 9 release I saw quite a few similar panics, happening because all the kernel thread stack is consumed by a pile of STREAMS modules. Here is a typical example:
402760b0 allocb+0x180()
40276110 ip_wput_frag_copyhdr+0x14()
40276170 ip_wput_frag+0x124()
40276290 ip_wput_ire+0x1a98()
402763e8 ire_send+0x218()
40276448 ire_add_then_send+0x304()
402764b0 ip_newroute+0xb78()
40276600 ire_send+0x198()
40276660 ire_add_then_send+0x304()
402766c8 ip_wput_nondata+0x874()
40276748 putnext+0x400()
402767a8 ar_query_reply+0x150()
40276808 ar_entry_query+0x154()
40276878 ar_rput+0x144()
402768e8 putnext+0x400()
40276948 ip_newroute+0x1760()
40276a98 putnext+0x400()
40276af8 udp_wput+0x63c()
40276b80 putnext+0x400()
40276be0 putnext+0x400()
40276c40 strput+0x57c()
40276d68 kstrputmsg+0x3bc()
40276de8 tli_send+0x20()
40276e48 t_ksndudata+0x284()
40276ea8 clnt_clts_kcallit_addr+0x3dc()
40276f90 clnt_clts_kcallit+0x38()
40277000 rfscall+0x46c()
402770b8 rfs3call+0x68()
40277130 nfs3write+0x11c()
402772b0 nfs3_bio+0x334()
40277310 nfs3_rdwrlbn+0xf0()
40277380 nfs3_sync_putapage+0x34()
402773e0 nfs3_putapage+0x364()
40277470 pvn_vplist_dirty+0x424()
40277568 nfs_putpages+0x174()
402775e0 nfs3_putpage+0xa8()
40277640 nfs_purge_caches+0x98()
402776a0 nfs_cache_check+0xf8()
40277700 nfs3_getattr_otw+0x12c()
40277840 nfs3_validate_caches+0x11c()
40277908 nfs3_getpage+0x44()
40277990 segvn_fault+0xa74()
40277a90 as_fault+0x4c4()
40277b40 pagefault+0x40()
40277ba8 trap+0xd90()
40277c80 utl0+0x4c()
Although each modules uses just a small amount, numbers quickly add up. It is like going to the grocery store to buy a whole bunch of small items. Each one is really cheap, but you end up paying a lot at the counter.
At the time I was fixing some bugs in the syncq implementation and spent a lot of time fooling around putnext() and its friends, so I spotted the possibility to use the existing putnext() ability to delegate work to a new kernel thread to avoid such panics. The idea was really simple: in addition to the usual work handout in case the perimeter is busy, perform the same handout when we are too close to blowing away the stack. So I added the highlighted code above together with the definition for PUT_STACK_NOTENOUGH:
#define PUT_STACK_NEEDED 5000
#define PUT_STACK_NOTENOUGH() \
(((STACK_BIAS + (uintptr_t)getfp() - \
(uintptr_t)curthread->t_stkbase) < put_stack_needed))
The value of PUT_STACK_NEEDED was chosen experimentally. It is impossible to predict how much stack will be used in the future. For example, a simple call to allocb() may, in some unlucky circumstances, trigger a long chain of calls through the kmem and vmem memory allocation layers. So the value of PUT_STACK_NEEDED was chosen to prevent common panics that we saw during the PIT runs 4.
I run the fix through the real kernel experts - Bryan Cantrill, who fixed a tricky kernel stack overflow problem before5 and Mike Shapiro. Mike suggested making PUT_STACK_NOTENOUGH a generic kernel function for others to use, while Bryan objected. We had a heated meeting in Mike's office and as a result, the following comment appeared at the top of putnext.c file:
/* * Streams with many modules may create long chains of calls via putnext() which * may exhaust stack space. When putnext detects that the stack space left is * too small (less then PUT_STACK_NEEDED), the call chain is broken and * further processing is delegated to the background thread via call to * putnext_tail(). Unfortunately there is no generic solution with fixed stack * size, and putnext() is recursive function, so this hack is a necessary evil. * * The redzone value is chosen dependent on the default stack size which is 8K * on 32-bit kernels and on x86 and 16K on 64-bit kernels. The values are chosen * empirically. For 64-bit kernels it is 5000 and for 32-bit kernels it is 2500. * Experiments showed that 2500 is not enough for 64-bit kernels and 2048 is not * enough for 32-bit. * * The redzone value is a tuneable rather then a constant to allow adjustments * in the field. * * The check in PUT_STACK_NOTENOUGH is taken from segkp_map_red() function. It * is possible to define it as a generic function exported by seg_kp, but * * a) It may sound like an open invitation to use the facility indiscriminately. * b) It adds extra function call in putnext path. * * We keep a global counter `put_stack_notenough' which keeps track how many * times the stack switching hack was used. */
The hack was integrated6 early in Solaris 10 and backported to S9 updates. Later in S10 timeframe another engineer did the real fix for the stack overflow problem - increased the stack size on 64-bit kernel (see bug 4922366 ). It is worth mentioning a comment, made by Jeff Bonwick in the evaluation of this bug:
Yep. Grow the kernel stack. Memory is cheap. Panics are expensive.
I've been through several kernel stack crises before. They always unfold the same way. Some particular workload goes too deep. We prune a few stack frames to fix the offending code path. Then another one comes up. And another. (Right about now someone suggests that instead of growing the stack for every thread, we should finally bite the bullet and make the kernel stack growable. I dig up my mail archive from the last time we contemplated this, and explain why it's much harder than it sounds.) Eventually the panic rate becomes so high that we have to act. Nobody can figure out how to make dynamic stack growth work reliably, so after one more jurassic outage we accept physics and grow the stack.
So my proposal is that this time, we dispense with all the hand-wringing and just grow the damn thing.
As it often happens, the stack protection hack uncovered another long-standing and interesting bug7 in Solaris qfill_syncq() function, but this is a subject of another blog... .
1 The added complexity comes from the multi-threaded nature of the kernel. While one thread tries to access q_next, another thread may change it at the same, which leads to chaos. Solaris STREAMS provide a rich set of synchronization mechanism, called STREAMS perimeters which simplify the life of module and driver writers at the cost of internal complexity of the implementation.
2 Modern compilers are very smart and can optimize a function call immediately followed by a return statement so that the callee reuses the stack frame of the caller. This is called tail-call optimization. It saves the stack space, but obscures debugging and sometimes even the DTrace. The problem is that the function call back-trace you see in the stack trace does not accurately represent the actuall calling sequences in the presence of tail-calls. To simplify debugging the seemingly useless ASSERT prevents such tail-call optimization on DEBUG kernels, while keeping all the performance benefits of production builds.
3 The STREAMS framework is very careful to pass down all messages in the order received.
4 It turned out that the stack barrier value of 5000 was too aggressive for 8K stacks on 32-bit sparc systems, so it was adjusted to 2500 on 32-bit platforms.
5 Bryan fixed bug 1259818 back in Solaris 2.6.
6 This was fixed as bug 4525533.
7 Don't spend your time trying to find the bug in qfill_syncq() function - it was fixed before S10 was released.
Playing with STREAMS module from the User-Land
In my previous STREAMS blog entry, we discussed, how to construct a do-nothing STREAMS module. Now we will come back to the user-land and see how we can play with STREAMS modules. We will learn, how to
Suppose that you have an open file descriptor fd and would like to know what STREAMS modules and drivers live behind the scene in the kernel in the STREAM representing the file.
The article by Rajesh Ramchandani on the Sun Developer Network provides an excellent example with the full source code of the printmod() function. It uses the I_LIST> ioctl which returns a list of modules in the struct str_mlist structure.
The next thing we are going to try is pushing our new module onto the STREAM. The following simple function should do the trick:
int pushmod(int fd, char *modname)
{
int rc;
if ((rc = ioctl(fd, I_PUSH, modname)) < 0) {
perror("I_PUSH");
fprintf(stderr, "pushmod(%d, %s) failed\n", fd, modname);
}
return rc;
}
The function takes a file descriptor and the module name and pushes the module on top of the stream. It returns 0 on success and -1 on failure. We can extend it a bit to put a whole list of modules. Suppose that the module list is a string with commas separating module names:
/* Push list of modules separated by commas */
int pushlist(int fd, char *s)
{
char *comma = strchr(s, ',');
int rc = 0;
if (comma == NULL)
return (pushmod(fd, s));
*comma = '\0';
if (((rc = pushmod(fd, s)) >= 0) && *(comma+1) != '\0') {
*comma = ',';
rc = pushlist(fd, comma + 1);
}
return (rc);
}
The following example demonstrates how it can be used in practice:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stropts.h>
#include <strings.h>
void main(int argc, char *argv[])
{
if (argc == 1)
return;
if (pushlist(0, argv[1]) < 0)
exit(1);
exit(0);
}
We can name this program as pushmod.c and try it (assuming that you have installed the nullmodmodule from the previous example:
$ cc pushmod.c -o pushmod
$ strconf
ttcompat
ldterm
ptem
pts
$ ./pushmod nullmod,nullmod,nullmod
nullmod
nullmod
nullmod
ttcompat
ldterm
ptem
pts
Finally, you may remove the module from the top of the stream using a simple call
if ((rc = ioctl(fd1, I_POP, 0)) < 0) {
perror("I_POP");
Now you know how to manipulate the content of the STREAM. You may want to play with it a bit and see what happens if you insert and remove some interesting modules.
NOTE: It is quite likely that your terminal window will become unusable as a result of your experiments. Many modules assume certain context and the are designed to play in concert with others, so your terminal may misbehave if it is incorrectly configured.
Open Solaris is Good for Linux!
There is quite a lot of discussion of various reasons and motivations for Sun to open source its crown jewel - the Solaris operating system. Quite a few people, both internally at Sun and externally believe that the move will help Sun as a company, but here I'd like to explore why opening Solaris is good for everyone else - for Linux for FreeBSD and for the computer community, in general.
In my opinion, the real value of the OpenSolaris is the opening of a vast amount of knowledge about the design of very complex computer systems. To really appreciate the value of this knowledge it helps to think about the way humans learn and understand the meaning of things.
The following quote is from the paper by Marvin Minsky:
Castles In The Air.
The secret of what something means lies in the ways that it connects to all the other things we know. The more such links, the more a thing will mean to us. The joke comes when someone looks for the "real" meaning of anything. For, if something had just one meaning, that is, if it were only connected to just one other thing, then it wold scarcely "mean" at all!
That's why I think we shouldn't program our machines that way, with clear and simple logic definitions. A machine programmed that way might never "really" understand anything -- any more than a person would. Rich, multiply-connected networks provide enough different ways to use knowledge that when one way doesn't work, you can try to figure out why. When there are many meanings in a network, you can turn things around in your mind and look at them from different perspectives; when you get stuck, you can try another view. That's what we mean by thinking!
That's why I dislike logic, and prefer to work with webs of circular definitions. Each gives meaning to the rest. There's nothing wrong with liking several different tunes, each one the more because it contrasts with the others. There's nothing wrong with ropes - or knots, or woven cloth - in which each strand helps hold the other strands together - or apart! There's nothing very wrong, in this strange sense, with having all one's mind a castle in the air!
To summarize: of course no computer could understand anything real -- or even what a number is - if forced to single ways to deal with them. But neither could a child or philosopher. So such concerns are not about computers at all, but about our foolish quest for meanings that stand by themselves, outside any context. Our questions about thinking machines should really be questions about our own minds.
In Minsky terminology, Solaris source is an extremely rich body of interwoven knowledge about the design of the state of the art computer systems. This body of knowledge was produced (and packed in the form of C code) in the course of many years of Solaris development by many extremely competent engineers. For a long time this knowledge was only available only to the small community of engineers and soon it will be available to everyone curious enough to tap it.
I am not suggesting that the knowledge embedded in other operating systems source code is any "better" or "worse" than the one embedded in the Solaris code. It was created by different people having different background, different objectives, different environments and different customer bases. It is just different. And, together, all of these provide even richer web of knowledge, that is much more useful then each individual part because they represent quite different dimensions.
So why is opening up a bunch of source code is really important to Linux (or FreeBSD, or any other software project)? Because, someone who takes the time and effort to read and understand even small parts of this embedded knowledge will almost certainly get new insights in whatever projects he or she is currently working on or thinking about. Even if the developer will not reuse any single line of the code, he will, definitely, gain in understanding his own area of expertise. Not to mention the trivial fact that the CDDL license allows developers to directly build their software based on the Solaris source. Consider, for example, an "open-sourcing" of a small part of Solaris design - the Slab Allocator, made by Jeff Bonwick in the form of the USENIX Paper and the followup paper. Was it useful to Linux and other software projects? If we ignore that fact that the slab allocator based on these papers is now the standard Linux kernel memory allocator, I am pretty sure that just reading these two papers was a very useful journey for a reader. And, be assured that the person who invented the Slab Allocator has more to say - and, indeed, says a lot - in C.
Another, more recent, example is DTrace, which is already available. It is not immediately obvious why DTrace is good for Linux, consider how much more effort is now put in creating the adequate Linux tracing facility that could compete with DTrace! Even if not a single line of DTrace source will find its way into the Linux distribution, it would definitely serve as a "prove of existence" and an inspiration. And, as Linux tracing facility will improve under the influence of DTrace, DTrace itself will improve to stay a relevant tool.
As a result of such cross-influence of ideas, the whole body of available software improves in its quality and the coverage and everyone wins!
And for this reason it makes sense for many computer engineers, students and just curious minds around to set aside some time to read and understand some parts of the OpenSolaris source code. And those who think that something embeds the whole meaning should, probably, reread the Minsky paper.
Technorati Tags: OpenSolaris, Solaris.
( May 21 2005, 01:09:09 AM PDT ) Permalink Comments [1]Ever wondered, what would it take to write a minimum STREAMS module that can be correctly installed onto Solaris system? Here is an example of such module (which we will call nullmod) with some explanations.
Every STREAMS module should define the following entry points:
Since our module will only pass messages back and forth it will use putnext(9f)
as both write and read side put procedure and will not have any service
procedures. So we need to define only open/close functions which we will call nullmodopen() and nullmodclose().
/*
* Module open routine.
* Mark the module as "opened" and link it to
* the STRREAM.
*/
static int
nullmodopen(queue_t *rq, dev_t *dev,
int oflag, int sflag, cred_t *crp)
{
if (sflag != MODOPEN)
return (EINVAL);
/*
* Prevent duplicate opens.
* The q_ptr is reserved for module private use
*/
if (rq->q_ptr != NULL)
return (0);
/* Mark the module as "opened" */
rq->q_ptr = WR(rq)->q_ptr = (void *)1;
/* Link the module into the STREAM */
qprocson(rq);
/*
* At this point module is linked in the STREAM
* and can send/receive messages.
* Its put/service procedures may execute at any time.
*/
return (0);
}
/*
* Module close routine.
* Disconnect the module from the STREAM.
*/
static int
nullmodclose(queue_t *rq)
{
/* Disconnect the module from the STREAM */
qprocsoff(rq);
/*
* At this point module is disconnected from the STREAM and can
* no longer receive messages. Its put or service procedures are not
* running.
*/
rq->q_ptr = WR(rq)->q_ptr = NULL;
return (0);
}
This is pretty much the only code we need to write. The rest is the glue code that should be present in any module. Here is the description of this glue code.
Every source file should start with comments, so we shall start the nullmod.c with the appropriate comment:
/* * Nullmod: the minimal functioning STREAMS module. * * Copyright ..... (place your favorite one here). * */
We will need certain system include files:
/* * Required include files. */ #include <sys/types.h> #include <sys/conf.h> #include <sys/cred.h> #include <sys/ddi.h> #include <sys/modctl.h>
As we discussed before, our module will define only two functions -
nullmodopen() and nullmodclose():
/* * Function prototypes. */ static int nullmodopen(queue_t *, dev_t *, int, int, cred_t *); static int nullmodclose(queue_t *);
Every STREAMS kernel module should have a corresponding module_info(9S) structure:
static struct module_info nullmod_minfo = {
1, /* mi_idnum */
"nullmod", /* mi_idname */
0, /* mi_minpsz */
INFPSZ, /* mi_maxpsz */
0, /* mi_hiwat */
0 /* mi_lowat */
};
Also it should have qinit(9S) structures for both the read and write sides:
static struct qinit nullmod_rinit = {
(int (*)())putnext, /* qi_putp */
NULL, /* qi_srvp */
nullmodopen, /* qi_qopen */
nullmodclose, /* qi_qclose */
NULL, /* qi_qadmin */
&nullmod_minfo, /* qi_minfo */
};
static struct qinit nullmod_winit = {
(int (*)())putnext, /* qi_putp */
NULL, /* qi_srvp */
NULL, /* qi_qopen */
NULL, /* qi_qclose */
NULL, /* qi_qadmin */
&nullmod_minfo, /* qi_minfo */
};
The streamtab(9S) structure links both together:
static struct streamtab nullmod_info = {
&nullmod_rinit, /* st_rdinit */
&nullmod_winit, /* st_wrinit */
};
The fmodsw(9S) structure describes tqhe module to the operating system, providing the pointer to the streamtab structure above:
static struct fmodsw fsw = {
"nullmod", /* module name */
&nullmod_info, /* streams information */
D_MP /* module flags - multithreaded module */
};
Now we need to provide the linkage information for the module:
/*
* Module linkage information for the kernel.
*/
struct mod_ops mod_strmodops;
static struct modlstrmod modlstrmod = {
&mod_strmodops, "Example pass-through module 1.0", &fsw
};
static struct modlinkage modlinkage = {
MODREV_1, (void *)&modlstrmod, NULL
};
Every loadable kernel module should also provide _init, _fini and _info entry points.
_init(9E) initializes a loadable module. It is called before any other routine in a loadable module. Most modules do not require any specific initialization and can just call mod_install(9F). _fini(9E) prepares a loadable module for unloading. It is called when the system wants to unload a module. In most cases it can just call mod_remove(9F). _info(9E) returns information about a loadable module. In most cases _info(9E) just returns the value returned by mod_info(9F).
/*
* Standard module entry points.
*/
int _init(void)
{
return (mod_install(&modlinkage));
}
int _fini(void)
{
return (mod_remove(&modlinkage));
}
int_info(struct modinfo *modinfop)
{
return (mod_info(&modlinkage, modinfop));
}
Now we need to put our definitions for nullmodopen() and nullmodclose() above
and we are done! Our ``do-nothing'' module is ready.
Let us save the module in the file nullmod.c. To compile it we need a C compiler which can generate the code for the native kernel mode - 32-it for x86 and 64-bit for Sparc or AMD64 platforms. If we are using Sun compilers we can use the following commands to produce the module binary on sparc:
cc -c -D_KERNEL -D_SYSCALL32 -D_SYSCALL32_IMPL -xarch=v9 nullmod.c
Or with gcc
gcc -c -D_KERNEL -D_SYSCALL32 -D_SYSCALL32_IMPL -m64 nullmod.c
After that we need to use ld with -r option to produce the final binary:
ld -r -o nullmod nullmod.o
Now we need to copy our module to /usr/kernel/strmod/sparcv9 (you need to be a super-user for this):
# cp nullmod /usr/kernel/strmod/sparcv9
!!WARNING!!: you are going to install the kernel module. It is possible that the system will panic if there is some problem in the module. Please make sure that no one else is using the system and you will not upset anyone (including yourself) if the system panics as a result of your experiments. End of WARNING.
Now we can load the module:
# modload nullmod
And verify that it is loaded:
# modinfo | grep nullmod 147 7bbbde30 320 - 1 nullmod (Example pass-through module 1.0)
We can also insert our module onto STDIN for our shell:
# strchg -h nullmod
and verify that it is present:
# strconf nullmod ttcompat ldterm ptem pts
When we are done, we can remove the module from our shell STDIN:
# strchg -p # strconf ttcompat ldterm ptem pts
You can also unload the module (it is usually not rquired as Solaris unloads unused modules itself):
# modunload -i 147 # modinfo|grep nullmod #
Now you know how to
Why this module may be of any use? I can think of several reasons:
Not bad for a simple do-nothing program!
( May 13 2005, 09:06:15 PM PDT ) Permalink Comments [3]Converting C arrays to Perl arrays in XS: Continuation
In the previous example we created a Perl function intlist() which returned a
list of integers starting from 0.
Here is a simple example:
$ perl -Mblib -MXS::Test \
-e 'print join(" ", intlist 5), "\n"';
0 1 2 3 4
What happens if we call the intlist function in scalar context?
$ perl -Mblib -MXS::Test \
-e 'print scalar intlist 5, "\n"';
4
It prints the last value! It does the same thing when getting scalar value of a list:
$ perl -e 'print scalar (0,1,2,3,4), "\n"' 4
The reason is that it treats the value in brackets as the comma operator which
returns the last value. Now what if we want to return something different when
the function is called in scalar context? For example, we may want to return the
total number of elements or some other value. In Perl we can use wantarray()
function for that purpose. There is a similar way in XS code.
Let us rewrite the intlist XS definition:
void
intlist(size)
int size;
PREINIT:
int *a;
PPCODE:
if (GIMME_V == G_SCALAR) {
if (size >= 0)
XSRETURN_IV(size);
else
XSRETURN_UNDEF;
}
if (size > 0 && New(0, a, size, int) != NULL) {
int i;
intlist(size, a);
EXTEND(SP, size);
for (i = 0; i < size; i++)
PUSHs(sv_2mortal(newSVnv(a[i])));
}
The expression GIMME_V == G_SCALAR is true when the function is called in
scalar context. In this case we return the size specified as an immediate return
value using the XSRETURN_IV macro or return undef value if size is
negative. Now, if we repeat the example above, we get what we expect:
$ perl -Mblib -MXS::Test \
-e 'print scalar intlist 5, "\n"';
5
Or, if we specify a negative value
$ perl -Mblib -MXS::Test -MData::Dumper \
-e 'print Dumper(scalar intlist -5), "\n"';
$VAR1 = undef;
Technorati Tag: Perl
( Apr 08 2005, 08:20:46 PM PDT ) Permalink Comments [1]Converting C arrays to Perl lists
Here is a short example of XS dealing with C arrays. I assume that you followed
direction in the previous section and have a populated ext/Test/XS directory.
Suppose that we have the following C function:
void intlist(int size, int *output)
{
for (i = 0; i < size; i++)
output[i] = i;
}
and we would like to get access to it from Perl. We will start by creating xs_test.c file together with our XS::Test module, containing the function above. We will also add
void intlist(int, int *);
line to the "xs_test.h" file.
Now we can write XS glue code to provide an interface with the C function. What
we will do there is get every entry from the C array and push it onto the Perl
stack and return it as a Perl list. We need to use the PPCODE directive to do
this:
void
intlist(size)
int size;
PREINIT:
int *a;
PPCODE:
if (size > 0 && New(0, a, size, int) != NULL) {
int i;
(void)intlist(size, a);
/* Create enough space on the stack */
EXTEND(SP, size);
for (i = 0; i < size; i++)
PUSHs(sv_2mortal(newSVnv(a[i])));
}
If the size argument is zero or negative or we can't allocate memory the function will return empty list to the caller.
To test the function let's add few tests to the XS-test.t file:
my @l = intlist 5;
ok(eq_array(\@l, [0, 1, 2, 3, 4]), 'Correct intlist array') or
diag("invalid result of intlist 5: \@l = @l");
my @l1 = intlist(-1);
my $l1 = @l1;
is($l1, 0, 'Invalid list is empty') or
diag("The returned list is @l1");
We are using the Test::More(1) eq_array function to compare two arrays.
Everything works fine so far. We can run
$ make test
in the module directory and all tests pass.
Technorati Tag: Perl
( Mar 22 2005, 06:31:09 PM PST ) Permalink Comments [0]Pitfals of the Perl XS or what to do when things do not work as advertised
I was working on providing Perl interface to the liblgrp(3LIB) library so
that simple scripts can be written to understand NUMA topology of a given
machine. I started by reading pretty good and extensive documentation explaining
how to write Perl extensions: perlxstut(1), perlxs(1), perlguts(1) and h2xs(1).
While these documents explain things pretty well I stumbled across various things that didn't work as advertised, so I decided to do a walk through a simple example and show what can go wrong and how to fix it.
The Perl XS documentation suggests using h2xs as a starting point. Let's follow
its recommendation. The h2xs documentation provides a simple example for
extension based on .h and .c files. We will assume that the header file is
called xs_test.h and we want the perl module be named as XS::Test. Let's
follow the h2xs(1) man page and perform Start with a dummy run of h2xs:
$ h2xs h2xs -Afn XS::Test
Writing ext/XS-Test/ppport.h Writing ext/XS-Test/lib/XS/Test.pm Writing ext/XS-Test/Test.xs Writing ext/XS-Test/Makefile.PL Writing ext/XS-Test/README Writing ext/XS-Test/t/XS-Test.t Writing ext/XS-Test/Changes Writing ext/XS-Test/MANIFEST
Let's remove ppport.h and use long option names for readability option names.
$ h2xs --skip-ppport --omit-autoload --force --name=XS::Test
Writing ext/XS-Test/lib/XS/Test.pm Writing ext/XS-Test/Test.xs Writing ext/XS-Test/Makefile.PL Writing ext/XS-Test/README Writing ext/XS-Test/t/XS-Test.t Writing ext/XS-Test/Changes Writing ext/XS-Test/MANIFEST
So far so good.
Now let's populate the directory with an empty header file
$ touch ext/XS-Test/xs_test.h
and create the actual extension. The man pages suggests to run
h2xs -Oxan Ext::Ension interface_simple.h
Note that -x requires the C::Scan(1) package. Let us assume that it is installed and continue by replacing short options with long ones and changing the module name:
$ h2xs --skip-ppport --overwrite-ok --autogen-xsubs \
--name=XS::Test xs_test.h
Can't find --overwrite-ok.h in . /usr/include /usr/sfw/include
...
/usr/include ext/XS/Test
Hmm... Looking at the h2xs we see that the actual option name is
--overwrite_ok instead of --overwrite-ok as advertised in the man page. So
let's retry using the right option name:
$ h2xs --skip-ppport --overwrite_ok --autogen-xsubs \
--name=XS::Test xs_test.h
Can't find xs_test.h in . /usr/include /usr/sfw/include /opt/sfw/include
...
/opt/GNU/include /usr/include ext/XS/Test
Obviously it is looking in the wrong place, the directory should be ext/XS-Test
instead of ext/XS/Test. This is another h2xs bug.
Let's fix both bugs and create our own copy of h2xs in ~/bin/h2xs.fixed:
$diff /usr/perl5/bin/h2xs ~/bin/h2xs.fixed 573c573 < 'overwrite_ok|O' => \$opt_O, --- > 'overwrite-ok|O' => \$opt_O, 787c787 < (my $epath = $module) =~ s,::,/,g; --- > (my $epath = $module) =~ s,::,-,g;
And now try again:
$ ~/bin/h2xs.fixed --skip-ppport --overwrite-ok --autogen-xsubs \
--name=XS::Test xs_test.h
Overwriting existing ext/XS-Test!!! Scanning typemaps... Scanning /usr/perl5/5.8.4/lib/ExtUtils/typemap Scanning xs_test.h for functions... Scanning xs_test.h for typedefs... Writing ext/XS-Test/lib/XS/Test.pm Writing ext/XS-Test/Test.xs Writing ext/XS-Test/fallback/const-c.inc Writing ext/XS-Test/fallback/const-xs.inc Writing ext/XS-Test/Makefile.PL Writing ext/XS-Test/README Writing ext/XS-Test/t/XS-Test.t Writing ext/XS-Test/Changes Writing ext/XS-Test/MANIFEST
Wow, now we should have a working extension that does nothing useful!
Let's move on and populate the header file with few simple constants:
$ cat >ext/XS-Test/xs_test.h #define XST_DEFINE 1
typedef enum xst_enum {
XST_ENUM_1,
XST_ENUM_2,
} xst_enum_t;
typedef enum xst_enum_val {
XST_ENUM_VAL_1 = 1,
XST_ENUM_VAL_2 = 2,
} xst_enum_val_t;
^D
Repeating the exercise we get
$ ~/bin/h2xs.fixed --skip-ppport --overwrite-ok --autogen-xsubs \
--name=XS::Test xs_test.h
Overwriting existing ext/XS-Test!!!
Scanning typemaps...
Scanning /usr/perl5/5.8.4/lib/ExtUtils/typemap
Scanning xs_test.h for functions...
Scanning xs_test.h for typedefs...
Use of uninitialized value in exists at
/home/akolb/bin/h2xs.fixed line 1025.
Writing ext/XS-Test/lib/XS/Test.pm
Writing ext/XS-Test/Test.xs
Writing ext/XS-Test/fallback/const-c.inc
Writing ext/XS-Test/fallback/const-xs.inc
Writing ext/XS-Test/Makefile.PL
Files "ext/XS-Test/fallback/const-c.inc" and
"ext/XS-Test/const-c.inc" differ.
It appears that the code in ext/XS-Test/Makefile.PL
does not autogenerate the files
ext/XS-Test/const-c.inc and
ext/XS-Test/const-xs.inc correctly.
Please report the circumstances of this bug in h2xs version 1.23 using the perlbug script. Writing ext/XS-Test/README Writing ext/XS-Test/t/XS-Test.t Writing ext/XS-Test/Changes Writing ext/XS-Test/MANIFEST
Something went wrong and we need to start digging around. Running
$ grep XST ext/XS-Test/* ext/XS-Test/*/* ext/XS-Test/*/*/*
we note that XST_ENUM_1 and XST_ENUM_2 are not present in the generated
code. It turns out that there is another bug in h2xs that incorrectly parses
enums. Fixing it gives
870c868 < my ($key, $declared_val) = $item =~ /(\w*)\s*=\s*(.*)/; --- > my ($key, $declared_val) = $item =~ /(\w+)\s*[=,]*\s*(.*)/; After we fix it, retry and grep again both missing constants appear together with other constants.
Hurra! Now our simple extension module it understands all our constants. Or does it? Let's make sure that everything works:
$ cd ext/XS-Test
$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for XS::Test
$ make test
...
Running Mkbootstrap for XS::Test ()
chmod 644 Test.bs
rm -f blib/arch/auto/XS/Test/Test.so
LD_RUN_PATH="" cc -G Test.o xs_test.o -o blib/arch/auto/XS/Test/Test.so
chmod 755 blib/arch/auto/XS/Test/Test.so
cp Test.bs blib/arch/auto/XS/Test/Test.bs
chmod 644 blib/arch/auto/XS/Test/Test.bs
PERL_DL_NONLAZY=1 perl "-MExtUtils::Command::MM" "-e" \
"test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/XS-Test....ok
All tests successful.
Files=1, Tests=2, 0 wallclock secs ( 0.03 cusr + 0.00 csys = 0.03 CPU)
That's promising! Now let's add simple tests to verify that we can actually access all constants from Perl. Going to t/XS_Test.t let's add tests that verify that constants have expected values:
$ diff XS-Test.t.~1~ XS-Test.t 8c8 < use Test::More tests => 2; --- > use Test::More tests => 7; 30a31,35 > is(XST_DEFINE, 1); > is(XST_ENUM_1, 0); > is(XST_ENUM_2, 1); > is(XST_ENUM_VAL_1, 1); > is(XST_ENUM_VAL_2, 2);
repeating the mantra:
$ make test
PERL_DL_NONLAZY=1 perl "-MExtUtils::Command::MM" "-e" \
"test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/XS-Test....ok 2/7
Your vendor has not defined XS::Test macro XST_ENUM_1, used at t/XS-Test.t line 32
# Looks like you planned 7 tests but only ran 3.
# Looks like your test died just after 3.
t/XS-Test....dubious
Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 4-7
Failed 4/7 tests, 42.86% okay
Failed Test Stat Wstat Total Fail Failed List of Failed
----------------------------------------------------------
t/XS-Test.t 255 65280 7 8 114.29% 4-7
Failed 1/1 test scripts, 0.00% okay. 4/7 subtests failed, 42.86% okay.
*** Error code 2
make: Fatal error: Command failed for target `test_dynamic'
Not so good! To figure out what's going on let's take a look at const-c.inc file:
...
switch (name[9]) {
case '1':
if (memEQ(name, "XST_ENUM_", 9)) {
/* 1 */
#ifdef XST_ENUM_1
*iv_return = XST_ENUM_1;
return PERL_constant_ISIV;
#else
return PERL_constant_NOTDEF;
#endif
That explains the problem. Of course, there are no defines for enum constants! We found another bug in h2xs. Instead of fixing it let's take a look at the ExtUtils::Constant module. Looking at the ``macro'' key in the C_constant section we see:
#if defined (foo) #if !defined (bar) ... #endif #endif
to be used to determine if a constant is to be defined.
A ``macro'' 1 signals that the constant is always defined, so the ``#if''/``#endif'' test is omitted.
So let's tweak Makefile.PL a bit:
$ diff Makefile.PL.~1~ Makefile.PL
22,23c22,28
< my @names = (qw( XST_DEFINE XST_ENUM_1 XST_ENUM_2
< XST_ENUM_VAL_1 XST_ENUM_VAL_2));
---
> my @names = (qw(XST_DEFINE),
> # Enums should be specified as references with macro set to 1.
> {name=>"XST_ENUM_1", macro=>"1"},
> {name=>"XST_ENUM_2", macro=>"1"},
> {name=>"XST_ENUM_VAL_1",macro=>"1"},
> {name=>"XST_ENUM_VAL_2",macro=>"1"});
And retry our test:
$ perl Makefile.PL Writing Makefile for XS::Test $ make ... Running Mkbootstrap for XS::Test () chmod 644 Test.bs rm -f blib/arch/auto/XS/Test/Test.so LD_RUN_PATH="" cc -G Test.o xs_test.o -o blib/arch/auto/XS/Test/Test.so chmod 755 blib/arch/auto/XS/Test/Test.so cp Test.bs blib/arch/auto/XS/Test/Test.bs chmod 644 blib/arch/auto/XS/Test/Test.bs Manifying blib/man3/XS::Test.3
$ make test
PERL_DL_NONLAZY=1 perl "-MExtUtils::Command::MM" \
"-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/XS-Test....ok
All tests successful.
Files=1, Tests=7, 0 wallclock secs ( 0.03 cusr + 0.00 csys = 0.03 CPU)
Finally it works correctly and we successfully created an extension which exports constants! Uphh... Seems like h2xs should be smart enough to treat enums as advertised, but we found a way around.
There is an alternative approach for our module that doesn't require autoloaded
functions and auto-generated constants and instead stashes constants directly in
the module name space using the XS BOOT keyword.
Our Test.xs file is very simple at this point:
#include "EXTERN.h" #include "perl.h" #include "XSUB.h"
#include <xs_test.h>
#include "const-c.inc"
MODULE = XS::Test PACKAGE = XS::Test
INCLUDE: const-xs.inc
Here is the trick:
$ diff Test.xs.~1~ Test.xs
7,8d6
< #include "const-c.inc"
<
11c9,27
< INCLUDE: const-xs.inc
---
> PROTOTYPES: ENABLE
>
> #
> # Define any constants that need to be exported. By doing it this way we can
> # avoid the overhead of using the DynaLoader package, and in addition constants
> # defined using this mechanism are eligible for inlining by the perl
> # interpreter at compile time.
> #
> BOOT:
> {
> HV *stash;
>
> stash = gv_stashpv("XS::Test", TRUE);
> newCONSTSUB(stash, "XST_DEFINE", newSViv(XST_DEFINE));
> newCONSTSUB(stash, "XST_ENUM_1", newSViv(XST_ENUM_1));
> newCONSTSUB(stash, "XST_ENUM_2", newSViv(XST_ENUM_2));
> newCONSTSUB(stash, "XST_ENUM_VAL_1", newSViv(XST_ENUM_VAL_1));
> newCONSTSUB(stash, "XST_ENUM_VAL_2", newSViv(XST_ENUM_VAL_2));
> }
This allows us to simplify Makefile.PL as well removing all uses of ExtUtils::Constant:
$ diff Makefile.PL.~1~ Makefile.PL
17,30d16
< if (eval {require ExtUtils::Constant; 1}) {
< my @names = (qw( XST_DEFINE XST_ENUM_1 XST_ENUM_2
< XST_ENUM_VAL_1 XST_ENUM_VAL_2));
< ExtUtils::Constant::WriteConstants(
< NAME => 'XS::Test',
< NAMES => \@names,
< DEFAULT_TYPE => 'IV',
< C_FILE => 'const-c.inc',
< XS_FILE => 'const-xs.inc',
< );
32,40d17
< }
< else {
< use File::Copy;
< use File::Spec;
< foreach my $file ('const-c.inc', 'const-xs.inc') {
< my $fallback = File::Spec->catfile('fallback', $file);
< copy ($fallback, $file) or die "Can't copy $fallback to $file: $!";
< }
< }
Also there is no need for the autoload code in Test.pm file:
$ diff lib/XS/Test.pm.~1~ lib/XS/Test.pm
9d8
< use AutoLoader;
42,64d40
< sub AUTOLOAD {
< # This AUTOLOAD is used to 'autoload' constants from the constant()
< # XS function.
<
< my $constname;
< our $AUTOLOAD;
< ($constname = $AUTOLOAD) =~ s/.*:://;
< croak "&XS::Test::constant not defined" if $constname eq 'constant';
< my ($error, $val) = constant($constname);
< if ($error) { croak $error; }
< {
< no strict 'refs';
< # Fixed between 5.005_53 and 5.005_61
< #XXX if ($] >= 5.00561) {
< #XXX *$AUTOLOAD = sub () { $val };
< #XXX }
< #XXX else {
< *$AUTOLOAD = sub { $val };
< #XXX }
< }
< goto &$AUTOLOAD;
< }
<
68,71d43
< # Preloaded methods go here.
<
< # Autoload methods go after =cut, and are processed by the autosplit program.
<
Let's verify that everything works:
$ make test
Running Mkbootstrap for XS::Test ()
chmod 644 Test.bs
rm -f blib/arch/auto/XS/Test/Test.so
LD_RUN_PATH="" cc -G Test.o xs_test.o -o blib/arch/auto/XS/Test/Test.so
chmod 755 blib/arch/auto/XS/Test/Test.so
cp Test.bs blib/arch/auto/XS/Test/Test.bs
chmod 644 blib/arch/auto/XS/Test/Test.bs
PERL_DL_NONLAZY=1 perl "-MExtUtils::Command::MM" \
"-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/XS-Test....ok
All tests successful.
Files=1, Tests=7, 0 wallclock secs ( 0.03 cusr + 0.00 csys = 0.03 CPU)
It works! Final Test.pm looks pretty simple:
package XS::Test;
use strict;
use warnings;
require Exporter;
our @ISA = qw(Exporter);
our %EXPORT_TAGS = ( 'all' => [ qw(
XST_DEFINE
XST_ENUM_1
XST_ENUM_2
XST_ENUM_VAL_1
XST_ENUM_VAL_2
) ] );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our @EXPORT = qw(
XST_DEFINE
XST_ENUM_1
XST_ENUM_2
XST_ENUM_VAL_1
XST_ENUM_VAL_2
);
our $VERSION = '0.01';
require XSLoader;
XSLoader::load('XS::Test', $VERSION);
1;
__END__
And the final Test.xs file:
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include <xs_test.h>
MODULE = XS::Test PACKAGE = XS::Test
PROTOTYPES: ENABLE
BOOT:
{
HV *stash;
stash = gv_stashpv("XS::Test", TRUE);
newCONSTSUB(stash, "XST_DEFINE", newSViv(XST_DEFINE));
newCONSTSUB(stash, "XST_ENUM_1", newSViv(XST_ENUM_1));
newCONSTSUB(stash, "XST_ENUM_2", newSViv(XST_ENUM_2));
newCONSTSUB(stash, "XST_ENUM_VAL_1", newSViv(XST_ENUM_VAL_1));
newCONSTSUB(stash, "XST_ENUM_VAL_2", newSViv(XST_ENUM_VAL_2));
}
This concludes our exercise. Now you should be able to write simple extensions
that just export constants from header files and avoid dangerous pitfalls!
Technorati Tag: Perl
( Mar 16 2005, 05:42:21 PM PST ) Permalink Comments [2]Grand Canyon Recycles Entering the Grand Canyon National Park we can see an interesting sign: "Grand Canyon Recycles". Indeed, I saw some serious birds looking for tourists who didn't stock enough water... ( Oct 18 2004, 06:10:13 PM PDT ) Permalink