вторник Июнь 14, 2005 Today is The Day. We are releasing the source code for the OpenSolaris. Welcome! It's incredible that anyone in the world can now look at what we've been working on for many years, and more importantly, choose to participate in it. Back when I was a CS student taking an OS Internals class and studying Solaris by reading the wonderful "Operating Systems Concepts" (aka "the dinasaur book") book by Abraham Silberschatz and "Unix Internals: The New Frontiers" by Uresh Valhalia, I can't imagine what a difference it would make to be able just open a browser and see the actual source code. This would've been so incredibly useful and educational. Unfortunately, our teacher at that time could only get us a copy of the "Lion's Commentary on Unix", the legendary underground classic book which covers the complete source code of an early version of Unix, and we had to study the ancient PDP-11 architecture to understand what each line of code is actually trying to do.
Now, fast forward a few years ahead, and you'll see how different the landscape is today. Being able to look at the source for the Operating System studied in the class, and even modify it in any way you want, is going to make a huge difference. Yes, I know, this has been possible with Linux or FreeBSD for a number of years, but with OpenSolaris today, we've given even more choice to developers. With all these changes happening to Solaris, many things, however, managed to stay the same, like the name of the disp() routine in disp.c1. Function and variable names no longer have the compiler restriction to be no more than 5 characters long as they did more than 20 years ago on PDP11, but nevertheless, creat(2) system call is still called creat(), and it will never be renamed to be create(2). But I digress...
So, with this preamble, I'd like to give you a quick tour of one small subsystem of the Solaris kernel that I've been working on during Solaris 10. The subsystem, one particular aspect of which I'm going to describe here today, is called "Resource Pools". With Resource Pools, various types of resources (such as CPUs, memory, etc) can be combined together and given unique names. Solaris had processor sets for a long time, but they were not persistent across a reboot, and didn't have easy to use names. So, we've changed the administrative model (while keeping all processor set interfaces intact for compatibility reasons) by making the pools configuration persistent, and allowing different types of resources, not just CPUs, to be used in the future. One of the projects that is under development right now is Memory Sets. This project will allow chunks of memory to be put in pools the same way CPUs are put today. So, while implementing the kernel part of the Resource Pools framework, we had to keep this in mind.
There has been one particular aspect we had to deal with in Resource Pools that took more brain cells to figure out than the others. That was the problem of binding of a collection of processes to a pool. Sounds simple, right? That's what I thought in the beginning too. I'll just grab pidlock (one global lock that protects the list of PIDs) so that new processes don't get created while I'm going around modifying the existing ones, and that would be it. In reality, this had to be done in a totally different way, partially because we couldn't actually have pidlock held while changing resource pool bindings for processes. The binding operation had the following requirements all of which had to be met:
It took us some time to figure out how we're going to satisfy all these requirements. Obviously, grabbing pidlock and cpu_lock (to prevent processor set configuration changes) wasn't an option anymore. What we came up with was much more delicate.
First, we've came up with the idea of a pool barrier for processes. That is, when a process is in a critical section of the code which may overlap with what pool rebinding operation is supposed to do, we're putting that process inside a special barrier thus preventing this process from being rebound while it is there. There are a few places in the kernel where we do that. There are two functions, pool_barrier_enter(), and pool_barrier_exit(), which serve this purpose. These have to be called while we're holding process' p_lock, and in the common non-contended case these functions are just incrementing and decrementing reference counts.
As you can see in pool_barrier_enter(), if the process is being rebound when we're entering the barrier, we're going to hang out there and wait until the rebinding operation finishes.
void
pool_barrier_enter(void)
{
proc_t *p = curproc;
ASSERT(MUTEX_HELD(&p->p_lock));
while (p->p_poolflag & PBWAIT)
cv_wait(&p->p_poolcv, &p->p_lock);
p->p_poolcnt++;
}
In pool_barrier_exit(), we have to do a more complicated dance since it is possible that a rebinding operation began while we were inside the barrier. In this case, the thread doing rebinding will be waiting for us to acknowledge that we're done and then we will tell that thread to go change our binding and we're going to wait.
Then we needed something to protect the pools configuration itself from changing while we're doing rebinding. A mutex or a read-write lock wouldn't work, since we were dealing with possibly really long hold times (with large memory sets), and the action of acquiring this lock had to be interruptible. What we came up with was the pool_lock_intr() routine which used a combination of one condition variable, and one mutex to give us the desired behavior.
Now that we have this infrastructure in place, we can look at the rebinding operation itself. The meat of the rebinding operation can be found in the pool_do_bind() routine. There's a good sized block comment above it which describes the steps we're going to take, and covers them in some amount of details. The first major step is to find all processes of interest and set PBWAIT flag in the p_poolflag for each one of them. Once we set these flags, we're guaranteed that these processes will not disappear, even after we drop the pidlock. We save all proc_t structure pointers in a special array so that we can safely reference these processes without pidlock being held.
The second step is to wait for relevant processes to stop before they try to enter the barrier or at the exit from the barrier. One important detail here -- we can't allow a debugger, or /proc to stop us while we're here while we're holding pool_lock. So we use the special lwp_nostop flag for that which will leave us stopped in post_syscall() on our way back to userland with no locks held, and then politely ask the library to retry the whole operation again.
The third step is pick up child processes that were created by fork, but didn't exist during our first scan of the process list. Their parents will now be stopped at pool_barrier_exit in cfork(). We can also now get rid of the processes that were in the middle of the pool barrier in exit(2) when we did our first scan. One interesting detail here is that since we're holding parent processes hostage, their children cannot start running again and create more children. The place for a call to pool_barrier_exit() in cfork() was carefully selected to prevent that.
/*
* pool_barrier_exit() can now be called because the child process has:
* - all identifying features cloned or set (p_pid, p_task, p_pool)
* - all resource sets associated (p_tlist->*->t_cpupart, p_as->a_mset)
* - any other fields set which are used in resource set binding.
*/
mutex_enter(&p->p_lock);
pool_barrier_exit();
mutex_exit(&p->p_lock);
A few lines later, the newly created child process is made runnable and it would be too late.
And now... (drumroll, please)... we've reached the end of our journey and
we finally ready to perform the bind operation for our processes. This may
include changing its processor set binding (which, by the way, we know is not
going to fail), and changing its scheduling class. When we're done with
all that, we insert the feeding tube into the process again and let him
go whatever it wants again.
/*
* Okay, we've tortured this guy enough.
* Let this poor process go now.
*/
pool_bind_wake(p);
That's pretty much it. I thought this was a good example of what types of problems we have to deal with in Solaris sometimes. It seems like a very simple task on the surface, but as more requirements are being put on it, it forces us to come up with new and sometimes unusual ways of solving them.
In my future blogs, I will try to cover some interesting parts of the Solaris dispatcher and its scheduling classes. There's a lot of topics there that deserve individual posts. What I like about working on scheduling problems is that they are never boring. Having DTrace in our arsenal of tools has definitely made it easier to analyze various scheduling problems and also help us verify the fixes for them. I'll also need to tell you (or even warn you) about the curse of disp.c.
#ifdef DEBUG
if (t->t_state == TS_ONPROC &&
t->t_disp_queue->disp_cpu == CPU &&
t->t_preempt == 0) {
thread_lock(t);
ASSERT(t->t_state != TS_ONPROC ||
t->t_disp_queue->disp_cpu != CPU ||
t->t_preempt != 0); /* cannot migrate */
thread_unlock_nopreempt(t);
}
#endif /* DEBUG */
So, if all three statements are true, we're grabbing thread's lock and testing
that at least on them is false. Any ideas on what this code is trying to do?
:-)