Thursday October 11, 2007 Just integrated CPU Caps project into S10U5. So far it was available only for users of OpenSolaris and SXDE, pretty soon it will be also available for regular S10 users. It is really exciting to see its applications in real life! Finally S10 users would be able to simply say in zonecfg:
zonecfg:myzone> add capped-cpu
zonecfg:myzone>capped-cpu> set ncpus=3
zonecfg:myzone>capped-cpu> end
( Oct 11 2007, 09:03:54 PM PDT ) Permalink Comments [2]
OpenSolaris - 1 year of opening Sun development
It seems just yesterday that Solaris went IPO. Yet, now it is celebrating its first anniversary. A good time to reflect on the effect of OpenSolaris on the internal Solaris developer community. What have really changes since Jun 14 2005?
I think the most important direct effect is that users and customers now ask direct questions about how something works and developers can explain them exactly what is going on. On more than one occasion I was exchanging e-mails with customers explaining the intricacies of the STREAMS framework implementation and pointing to specific snippets of code and getting questions about this code. This has a downside as well: the easily availability of code may create lots of implicit dependencies on the implementation details. Hopefully developers will continue sticking to the published APIs.
During the first year OpenSolaris site become a huge repository of technical documentation for both existing and future projects. One day we discovered that someone decided that our server hosting all the internal documentation for the NUMA project is a test machine and completely wiped out all the content. We decided to just move all the information to the OpenSolaris site - together with prototype code and binaries.
Another major change for the internal developers is that they are not quite "internal" any more! We now routinely publish proposals, code reviews, prototypes PSARC cases and other development by-products and are anxiously expecting useful feedback. It seems like some areas (e.g getting the favorite shell as Solaris default) are getting much more attention than boring issues of scheduling and memory optimization, so I hope that the next year will catch up. All of us would like to see deeper penetration of community involvement in the guts of Solaris internals.
I have seen several cases when some projects didn't want to go open initially and tried to follow the traditional path. The peer pressure inevitably pushed them out - for the better. I myself usually initiated internal code reviews before opening a public discussion - all of us want to avoid embarassement :-). I think, by now we are doing much more open development. It is becoming a norm by now and getting part of the process. And part of the fun!
[ T: OpenSolaris anniversary 2006 ]
( Jun 15 2006, 07:13:29 PM PDT ) PermalinkIt is common for a kernel programmer to postpone processing of some tasks and delegate their execution to another kernel thread. There may be several reasons for doing this:
In all these cases programmer, in essense, needs to execute a piece of code (task) in a different context, where context usually means another kernel thread with different set of locks held and, possibly, a different priority.
Until introduction of task queues in Solaris 8 there was no generic OS facility for such in-kernel context change. Every subsystem used its own ad-hoc mechanisms, usually utilizing ``worker threads'' together with a list of jobs to give them. The task queues interface abstracts common code out of these mechanisms and provides simple way of scheduling asynchronous tasks.
A task queue consists of a list of tasks, together with one or more threads to service the list. If a task queue has a single service thread, all tasks are guaranteed to execute in the order they were dispatched. Otherwise they can be executed in any order. Note that since tasks are placed on a list, execution of one task and should not depend on the execution of another task or a deadlock may occur. A taskq created with a single servicing thread guarantees that all the tasks are serviced in the order in which they are scheduled.
Kernel users should use the documented DDI interface for all taskq operations. These interfaces are defined in the usr/src/uts/common/sys/sunddi.h header file. The exported interface consists of the following functions:
Every taskq created in the system keeps a set of kstat counters associated with it. Try running the following command on your system:
$ kstat -c taskq
module: unix instance: 0
name: ata_nexus_enum_tq class: taskq
crtime 53.877907833
executed 0
maxtasks 0
nactive 1
nalloc 0
priority 60
snaptime 258059.249256749
tasks 0
threads 1
totaltime 0
module: unix instance: 0
name: callout_taskq class: taskq
crtime 0
executed 13956358
maxtasks 4
nactive 4
nalloc 0
priority 99
snaptime 258059.24981709
tasks 13956358
threads 2
totaltime 120247890619
...
The kstat information above includes:
You can use the power of the kstat command to observe how some counter increases over time:
$ kstat -p unix:0:callout_taskq:tasks 1 5
unix:0:callout_taskq:tasks 13994642
unix:0:callout_taskq:tasks 13994711
unix:0:callout_taskq:tasks 13994784
unix:0:callout_taskq:tasks 13994855
unix:0:callout_taskq:tasks 13994926
...
The taskq implementation also provides several useful SDT probes: All the probes described below have two arguments: the taskq pointer and the pointer to the pointer to the taskq_ent_t structure. It can be used to extract the function and the argument from the D script.
Developers can use these probes to collect precise timing information about individual task queues and individual tasks being executed through them. For example, the following script will print what functions were scheduled via task queues for every 10 seconds:
#!/usr/sbin/dtrace -qs
sdt:genunix::taskq-enqueue
{
this->tq = (taskq_t *)arg0;
this->tqe = (taskq_ent_t *) arg1;
@[this->tq->tq_name,
this->tq->tq_instance,
this->tqe->tqent_func] = count();
}
tick-10s
{
printa ("%s(%d): %a called %@d times\n", @);
trunc(@);
}
Running this on my desktop produced the following output1:
callout_taskq(1): genunix`callout_execute called 51 times
callout_taskq(0): genunix`callout_execute called 701 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 40 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 256 times
callout_taskq(0): genunix`callout_execute called 702 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 28 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 228 times
callout_taskq(0): genunix`callout_execute called 706 times
callout_taskq(1): genunix`callout_execute called 24 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 141 times
callout_taskq(0): genunix`callout_execute called 708 times
Suppose that two friends, Bob and Alice are staying in the cafeteria line with Alice standing behind Bob. The cashier checks Bobs' tray and it turns out that Bob doesn't have enough money, so he wants to borrow from Alice. But Alice is not sure whether she has enough cash until she knows the cost of her lunch. This is a typical deadlock situation - both Bob and Alice can not make any forward progress waiting for each other. The same kind of deadlock may occur if two tasks A and B are placed on a queue which is served by a single thread when there is a resource dependency between A and B. One way to prevent such a deadlock is to guarantee that A and B are processed by two different threads, so that when A stalls for B the thread processing A will block until B makes enough progress and can provide the needed resource to B.
Dynamic task queues provide exactly such deadlock-free way of scheduling potentially dependent tasks on the same queues. They guarantee that every task is processed by a separate thread. Since the amount of tasks that can be scheduled at the same time is not known in advance, dynamic task queues maintain a dynamic thread pool that grows when the workload increases and shrinks when the workload dries off.
Dynamic task queues can not (yet) be used via the DDI interfaces. Some kernel
subsystems use the internal taskq calls directly to create and use
dynamic task queues. The system also maintains one shared dynamic task queue
called system_taskq. It can be used by specifying
system_taskq as the taskq argument to the
taskq_dispatch() function. It is really a good idea to also add
"TQ_NOSLEEP | TQ_NOQUEUE" to the flags when using
system_taskq.
Each taskq is implemented as a list of tasks protected by a per-taskq lock. One or more worker threads take tasks one by one and execute them by calling f(a) and then sleep, waiting for new entries. A taskq created with a single servicing thread has an important property: it guarantees that all its tasks are executed in the order they are scheduled. When a task queue is created with several servicing threads, task execution order is not predictable.
If you want to look at the actual implementation you need to look at the following files:
The first taskq implementation was done by Jeff Bonwick for Solaris 8. It was successfully used to replace many calls to the low-level thread_create() function. I added Dynamic Task Queues in Solaris 9 and used them to completely re-implement the STREAMS scheduler. In Solaris 10 I added DDI interfaces for task queues and also added kstat counters and DTrace probes.
1 For curious minds: the callout_taskq is used to handle system timers. As an exercise in your DTrace skills, try to figure out what actual timers are firing on each CPU. Hint - use the callout-start SDT probe, which has a pointer to the callout_t structure as its sole argument.
Technorati Tag: Solaris
Technorati Tag: OpenSolaris
Technorati Tag: DTrace
Technorati Tag: Kernel
Now, when the OpenSolaris is finally a reality, engineers can really start talking about the interesting staff. We spend most of our time dealing with code and it quite difficult to talk about what you do without being able to provide examples. Now is the time for the real technical blogging instead of the hand-waving.
I spent quite a lot of my time at Sun hacking STREAMS internals - things that almost no one ever notices, except for some unlucky folks who run into some nasty issues. Here I want to continue the sewer tour, started by Bryan Cantrill and take interested visitors to some of the STREAMS sewers.
Our tour will start with a few lines of code in the putnext() function. In the original implementation it was just a simple macro calling the put procedure of the next module in the STREAM:
#define putnext(q, mp) ((*(q)->q_next->q_qinfo->qi_putp)((q)->q_next, (mp)))
In Solaris it evolved into a rather complicated code1. Its inner workings deserve quite a few separate blog entries, but now we will take a side tour to a small piece of code that may catch your attention:
/*
* If there are writers or exclusive waiters, there is not much
* we can do. Place the message on the syncq and schedule a
* background thread to drain it.
*
* Also if we are approaching end of stack, fill the syncq and
* switch processing to a background thread - see comments on
* top.
*/
if ((flags & (SQ_STAYAWAY|SQ_EXCL|SQ_EVENTS)) ||
(sq->sq_needexcl != 0) || PUT_STACK_NOTENOUGH()) {
/*
* NOTE: qfill_syncq will need QLOCK. It is safe to drop
* SQLOCK because positive sq_count keeps the syncq from
* closing.
*/
mutex_exit(SQLOCK(sq));
qfill_syncq(sq, qp, mp);
/*
* NOTE: after the call to qfill_syncq() qp may be
* closed, both qp and sq should not be referenced at
* this point.
*
* This ASSERT is located here to prevent stack frame
* consumption in the DEBUG code.2
*/
ASSERT(sqciplock == NULL);
return;
}
The code uses a notion of a syncq, which is a synchronization abstraction used by STREAMS. Whenever some module calls the putnext() function, the code checks for various conditions and if everything seems all-right, it just calls the put procedure of the next module in the STREAM. If something doesn't seem right (e.g. the target module is busy processing other messages), the message is placed on the special queue in the syncq and the framework arranges a different kernel thread to pass the enqueued message to the next module when the module is ready3. The important observation is that this kernel thread executes has its own fresh stack. The highlighted code above hijacks this normal STREAMS mechanism to protect from a rather nasty problem - kernel stack overflow.
Note that STREAMS is a very flexible framework that allows many modules to be linked together in a chain (or even something more complex like a tree or U-pipe). Every new Solaris release includes some new STREAMS modules and drivers that often play in concert to provide some exciting new functionality. The downside is that the chain-calling of module's put procedures creates really long call chains on the stack and sometimes causes kernel panics caused by the stack overflow.
Soon after Solaris 9 release I saw quite a few similar panics, happening because all the kernel thread stack is consumed by a pile of STREAMS modules. Here is a typical example:
402760b0 allocb+0x180()
40276110 ip_wput_frag_copyhdr+0x14()
40276170 ip_wput_frag+0x124()
40276290 ip_wput_ire+0x1a98()
402763e8 ire_send+0x218()
40276448 ire_add_then_send+0x304()
402764b0 ip_newroute+0xb78()
40276600 ire_send+0x198()
40276660 ire_add_then_send+0x304()
402766c8 ip_wput_nondata+0x874()
40276748 putnext+0x400()
402767a8 ar_query_reply+0x150()
40276808 ar_entry_query+0x154()
40276878 ar_rput+0x144()
402768e8 putnext+0x400()
40276948 ip_newroute+0x1760()
40276a98 putnext+0x400()
40276af8 udp_wput+0x63c()
40276b80 putnext+0x400()
40276be0 putnext+0x400()
40276c40 strput+0x57c()
40276d68 kstrputmsg+0x3bc()
40276de8 tli_send+0x20()
40276e48 t_ksndudata+0x284()
40276ea8 clnt_clts_kcallit_addr+0x3dc()
40276f90 clnt_clts_kcallit+0x38()
40277000 rfscall+0x46c()
402770b8 rfs3call+0x68()
40277130 nfs3write+0x11c()
402772b0 nfs3_bio+0x334()
40277310 nfs3_rdwrlbn+0xf0()
40277380 nfs3_sync_putapage+0x34()
402773e0 nfs3_putapage+0x364()
40277470 pvn_vplist_dirty+0x424()
40277568 nfs_putpages+0x174()
402775e0 nfs3_putpage+0xa8()
40277640 nfs_purge_caches+0x98()
402776a0 nfs_cache_check+0xf8()
40277700 nfs3_getattr_otw+0x12c()
40277840 nfs3_validate_caches+0x11c()
40277908 nfs3_getpage+0x44()
40277990 segvn_fault+0xa74()
40277a90 as_fault+0x4c4()
40277b40 pagefault+0x40()
40277ba8 trap+0xd90()
40277c80 utl0+0x4c()
Although each modules uses just a small amount, numbers quickly add up. It is like going to the grocery store to buy a whole bunch of small items. Each one is really cheap, but you end up paying a lot at the counter.
At the time I was fixing some bugs in the syncq implementation and spent a lot of time fooling around putnext() and its friends, so I spotted the possibility to use the existing putnext() ability to delegate work to a new kernel thread to avoid such panics. The idea was really simple: in addition to the usual work handout in case the perimeter is busy, perform the same handout when we are too close to blowing away the stack. So I added the highlighted code above together with the definition for PUT_STACK_NOTENOUGH:
#define PUT_STACK_NEEDED 5000
#define PUT_STACK_NOTENOUGH() \
(((STACK_BIAS + (uintptr_t)getfp() - \
(uintptr_t)curthread->t_stkbase) < put_stack_needed))
The value of PUT_STACK_NEEDED was chosen experimentally. It is impossible to predict how much stack will be used in the future. For example, a simple call to allocb() may, in some unlucky circumstances, trigger a long chain of calls through the kmem and vmem memory allocation layers. So the value of PUT_STACK_NEEDED was chosen to prevent common panics that we saw during the PIT runs 4.
I run the fix through the real kernel experts - Bryan Cantrill, who fixed a tricky kernel stack overflow problem before5 and Mike Shapiro. Mike suggested making PUT_STACK_NOTENOUGH a generic kernel function for others to use, while Bryan objected. We had a heated meeting in Mike's office and as a result, the following comment appeared at the top of putnext.c file:
/* * Streams with many modules may create long chains of calls via putnext() which * may exhaust stack space. When putnext detects that the stack space left is * too small (less then PUT_STACK_NEEDED), the call chain is broken and * further processing is delegated to the background thread via call to * putnext_tail(). Unfortunately there is no generic solution with fixed stack * size, and putnext() is recursive function, so this hack is a necessary evil. * * The redzone value is chosen dependent on the default stack size which is 8K * on 32-bit kernels and on x86 and 16K on 64-bit kernels. The values are chosen * empirically. For 64-bit kernels it is 5000 and for 32-bit kernels it is 2500. * Experiments showed that 2500 is not enough for 64-bit kernels and 2048 is not * enough for 32-bit. * * The redzone value is a tuneable rather then a constant to allow adjustments * in the field. * * The check in PUT_STACK_NOTENOUGH is taken from segkp_map_red() function. It * is possible to define it as a generic function exported by seg_kp, but * * a) It may sound like an open invitation to use the facility indiscriminately. * b) It adds extra function call in putnext path. * * We keep a global counter `put_stack_notenough' which keeps track how many * times the stack switching hack was used. */
The hack was integrated6 early in Solaris 10 and backported to S9 updates. Later in S10 timeframe another engineer did the real fix for the stack overflow problem - increased the stack size on 64-bit kernel (see bug 4922366 ). It is worth mentioning a comment, made by Jeff Bonwick in the evaluation of this bug:
Yep. Grow the kernel stack. Memory is cheap. Panics are expensive.
I've been through several kernel stack crises before. They always unfold the same way. Some particular workload goes too deep. We prune a few stack frames to fix the offending code path. Then another one comes up. And another. (Right about now someone suggests that instead of growing the stack for every thread, we should finally bite the bullet and make the kernel stack growable. I dig up my mail archive from the last time we contemplated this, and explain why it's much harder than it sounds.) Eventually the panic rate becomes so high that we have to act. Nobody can figure out how to make dynamic stack growth work reliably, so after one more jurassic outage we accept physics and grow the stack.
So my proposal is that this time, we dispense with all the hand-wringing and just grow the damn thing.
As it often happens, the stack protection hack uncovered another long-standing and interesting bug7 in Solaris qfill_syncq() function, but this is a subject of another blog... .
1 The added complexity comes from the multi-threaded nature of the kernel. While one thread tries to access q_next, another thread may change it at the same, which leads to chaos. Solaris STREAMS provide a rich set of synchronization mechanism, called STREAMS perimeters which simplify the life of module and driver writers at the cost of internal complexity of the implementation.
2 Modern compilers are very smart and can optimize a function call immediately followed by a return statement so that the callee reuses the stack frame of the caller. This is called tail-call optimization. It saves the stack space, but obscures debugging and sometimes even the DTrace. The problem is that the function call back-trace you see in the stack trace does not accurately represent the actuall calling sequences in the presence of tail-calls. To simplify debugging the seemingly useless ASSERT prevents such tail-call optimization on DEBUG kernels, while keeping all the performance benefits of production builds.
3 The STREAMS framework is very careful to pass down all messages in the order received.
4 It turned out that the stack barrier value of 5000 was too aggressive for 8K stacks on 32-bit sparc systems, so it was adjusted to 2500 on 32-bit platforms.
5 Bryan fixed bug 1259818 back in Solaris 2.6.
6 This was fixed as bug 4525533.
7 Don't spend your time trying to find the bug in qfill_syncq() function - it was fixed before S10 was released.
Open Solaris is Good for Linux!
There is quite a lot of discussion of various reasons and motivations for Sun to open source its crown jewel - the Solaris operating system. Quite a few people, both internally at Sun and externally believe that the move will help Sun as a company, but here I'd like to explore why opening Solaris is good for everyone else - for Linux for FreeBSD and for the computer community, in general.
In my opinion, the real value of the OpenSolaris is the opening of a vast amount of knowledge about the design of very complex computer systems. To really appreciate the value of this knowledge it helps to think about the way humans learn and understand the meaning of things.
The following quote is from the paper by Marvin Minsky:
Castles In The Air.
The secret of what something means lies in the ways that it connects to all the other things we know. The more such links, the more a thing will mean to us. The joke comes when someone looks for the "real" meaning of anything. For, if something had just one meaning, that is, if it were only connected to just one other thing, then it wold scarcely "mean" at all!
That's why I think we shouldn't program our machines that way, with clear and simple logic definitions. A machine programmed that way might never "really" understand anything -- any more than a person would. Rich, multiply-connected networks provide enough different ways to use knowledge that when one way doesn't work, you can try to figure out why. When there are many meanings in a network, you can turn things around in your mind and look at them from different perspectives; when you get stuck, you can try another view. That's what we mean by thinking!
That's why I dislike logic, and prefer to work with webs of circular definitions. Each gives meaning to the rest. There's nothing wrong with liking several different tunes, each one the more because it contrasts with the others. There's nothing wrong with ropes - or knots, or woven cloth - in which each strand helps hold the other strands together - or apart! There's nothing very wrong, in this strange sense, with having all one's mind a castle in the air!
To summarize: of course no computer could understand anything real -- or even what a number is - if forced to single ways to deal with them. But neither could a child or philosopher. So such concerns are not about computers at all, but about our foolish quest for meanings that stand by themselves, outside any context. Our questions about thinking machines should really be questions about our own minds.
In Minsky terminology, Solaris source is an extremely rich body of interwoven knowledge about the design of the state of the art computer systems. This body of knowledge was produced (and packed in the form of C code) in the course of many years of Solaris development by many extremely competent engineers. For a long time this knowledge was only available only to the small community of engineers and soon it will be available to everyone curious enough to tap it.
I am not suggesting that the knowledge embedded in other operating systems source code is any "better" or "worse" than the one embedded in the Solaris code. It was created by different people having different background, different objectives, different environments and different customer bases. It is just different. And, together, all of these provide even richer web of knowledge, that is much more useful then each individual part because they represent quite different dimensions.
So why is opening up a bunch of source code is really important to Linux (or FreeBSD, or any other software project)? Because, someone who takes the time and effort to read and understand even small parts of this embedded knowledge will almost certainly get new insights in whatever projects he or she is currently working on or thinking about. Even if the developer will not reuse any single line of the code, he will, definitely, gain in understanding his own area of expertise. Not to mention the trivial fact that the CDDL license allows developers to directly build their software based on the Solaris source. Consider, for example, an "open-sourcing" of a small part of Solaris design - the Slab Allocator, made by Jeff Bonwick in the form of the USENIX Paper and the followup paper. Was it useful to Linux and other software projects? If we ignore that fact that the slab allocator based on these papers is now the standard Linux kernel memory allocator, I am pretty sure that just reading these two papers was a very useful journey for a reader. And, be assured that the person who invented the Slab Allocator has more to say - and, indeed, says a lot - in C.
Another, more recent, example is DTrace, which is already available. It is not immediately obvious why DTrace is good for Linux, consider how much more effort is now put in creating the adequate Linux tracing facility that could compete with DTrace! Even if not a single line of DTrace source will find its way into the Linux distribution, it would definitely serve as a "prove of existence" and an inspiration. And, as Linux tracing facility will improve under the influence of DTrace, DTrace itself will improve to stay a relevant tool.
As a result of such cross-influence of ideas, the whole body of available software improves in its quality and the coverage and everyone wins!
And for this reason it makes sense for many computer engineers, students and just curious minds around to set aside some time to read and understand some parts of the OpenSolaris source code. And those who think that something embeds the whole meaning should, probably, reread the Minsky paper.
Technorati Tags: OpenSolaris, Solaris.
( May 21 2005, 01:09:09 AM PDT ) Permalink Comments [1]