Friday Aug 10, 2007

Ten weeks go by pretty quickly. I wish I had more time: to finish projects I'm already working on, to start new ones, to keep on having fun here, and to post an appropriate and compelling retrospective. Alas, it's not happening, as I've been told my blog gets shut down as soon as I walk out today, so: thanks everyone.

My email is dan.kuebrich at gmail.com.

Thursday Aug 09, 2007

If you ran the following code, what would you expect it to do?
	int sh = 33;
	printf("5 >> 33 = %d\n", 5 >> sh);
Reduced to looking at the last byte of an integer 5, we see that 00000101 >> 33 = 00000000. Right? Actually, it turns out, the answer is 2 (00000010). This is a consequence of the shift operation implementation in hardware: the operand to the shift command (on both Intel and SPARC assembly) is modulo the size of the data being operated on. So if you shift a 32-bit int by 32, you'll get no change, etc. This differs from the expected behavior in c. After all, (int)(5 / pow(2,32)) = 0, not 5. However, it turns out this would be pretty expensive for the compiler to correct for: virtually every shift would also be augmented by a conditional. And thus, for what I can only assume to be that reason, the C99 standard actually states that this behavior is undefined:

"The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined."

Tuesday Jul 24, 2007

These blog posts are always somewhat delayed (as though I live in some sort of time-warped zone), and I'm currently mostly helping with a prototype of a new packaging system being put together by Stephen, Danek, and Bart. However, also in the pipeline is the .zfs/props extension I wrote about below. It has brought me great deal of VFS-implementation flashbacks, and, more importantly, an enhanced appreciation of MDB as a debugging tool. I remember not too long ago bemoaning its assembly-level nature to my carpool-mate Colin; an ironic foreshadowing of the shift in opinion which would precipitate over the coming weeks.

The story begins dereferencing a NULL pointer. Yet unlike other NULL pointers, this was a special NULL pointer--one with an address much greater than that of any NULL pointer which came before it. While the panic stated it was NULL, it also printed out its value, which happened to be larger than zero (by at least a factor of infinity), though obviously not in valid kernel space. A little investigation brought me to this segment of trap.c:
uts/i86pc/os/trap.c:die():192
192	if (type == T_PGFLT && addr < (caddr_t)KERNELBASE) {
193		panic("BAD TRAP: type=%x (#%s %s) rp=%p addr=%p "
194		    "occurred in module \"%s\" due to %s",
195		    type, trap_mnemonic, trap_name, (void *)rp, (void *)addr,
196		    mod_containing_pc((caddr_t)rp->r_pc),
197		    addr < (caddr_t)PAGESIZE ?
198		    "a NULL pointer dereference" :
199		    "an illegal access to a user address");
200	} else
201		panic("BAD TRAP: type=%x (#%s %s) rp=%p addr=%p",
202		    type, trap_mnemonic, trap_name, (void *)rp, (void *)addr);
Pointers dereferenced in the first page are treated as NULL. OK, this makes sense: they could be indexes of a NULL array, or fields in a NULL struct, etc. However, a comparison of the pagesize to the address in question revealed that it was in fact not in the first page. The code is pretty clear, so what could be going on? There was only one option: to walk through the code in MDB and see what was happening.

I'd really like to draw out this investigation with suspense and mystery, but it's really hard to stuff that all in between firing up MDB and looking at the disassembled code (estimated time elapsed: < 5 minutes). For the gory details, see CR 6578504. SPOILERS: It turns out that the SunStudio 11 compiler, switched to between builds 23 and 24 of ONNV, cleverly "optimizes" out the ternary operator on lines 197-199, leaving only the string "a NULL pointer dereference". This only occurs on x64 compilation, and means that you will never get a report of illegally accessing a user address in kernel mode. Cool.

Friday Jun 29, 2007

Today marks the end of only the 20th day of my internship here at Sun, but it has been four weeks, and I couldn't wait any longer to use this title, so, here we are.[1] There's a couple little projects I've been working on in the past week, and I'll have to present these things in about six weeks, so I'd better remember them. Prepare for a list appropriate for only the most private of journals:

THE WEEK IN REVIEW
  • Timewarp Zones: where'd they go? Currently, a rough prototype is finished. The time-adjustment is done at the system call layer (time, stime, etc. as well as stat, utime, etc.). Did you know that there's a specific ntp_gettime syscall? I'm pretty sure that there's nothing that uses it[2], but, if there is, it's zone-time safe. In answer to the queries from the previous post: first, if a zone is independent in time, it should remain that way. Thus, changes in actual system clock time (global zone time) result in a modification of the offsets of TWZs to preserve their independence. Second, the filesystem is stored in system time, and simply wrapped with offset time on view in the TWZ. This does preclude some truth in the filesystem: if I touch a file, then change the time, the file will have always been touched immediately prior to the current time. However, the presumed use scenario for such a zone involves setting the zone into the future, then running commands which depend only on the date being some future date (not a relative gap having been established in the filesystem), so this should be acceptable. It's a small price to pay for the reduced complexity (see: validity of NFS cache entries, etc.). If this turns out to not be the case, a good deal more work will have to be done. The third question will probably be answered as the concept of TWZs moves through its current life phase: the midlife crisis and question of whether it will continue to be developed.

  • ZFS Properties: as the TWZ project is on ice, I figured I'd try to get a taste for ZFS work.[3] As such, I took a recommendation from George Wilson and have been working on RFE 6527390, which requests the availability of ZFS properties in an NFS-accessible way. So far, it seems that the best strategy for implementing this is to extend the .zfs directory. You may know this pseudo-filesystem star as a directory located in the root of every ZFS filesystem which contains the snapshots you've saved. As envisioned, it would also become the home to a properties directory, containing a file for each ZFS property (eg. /myzfs/.zfs/props/compression) which when read would produce the value, and perhaps which when written would alter the value. There's also the issue of whether the files should produce human-readable or script-friendly output (eg. 1g vs. 1073741824). To quote an email I recently received from my mother, the mind reels with possibilities.

    I was looking to get a taste of ZFS, and so far I have gotten about that, but I've also gotten mouthfuls of Dave Powell's GFS, which, fortunately, is very palatable. It's a generic pseudo-filesystem implementation written for Solaris, and is also the subject of the third Google result for "pseudo filesystem". It greatly reduces the overhead involved in creating a pseudo-filesystem--I only have to write the vnode ops that I want to specialize (and a few callbacks), which lets me spend my time on writing the backend access functionality (don't think about those last 3 words too much), not the frontend. I can't do it enough justice here, but there is a good discussion in the author's blog.
There's been more--I wrote my first sed-using script[4], pranked Colin's ps ("Dan, why is there a process called hamsandwich that I can't kill running on my machine?"), etc--but it's about time to call it quits.


[#1] If you don't think it's as exciting as being in a movie killing zombies with a shotgun, you may not be an intern here.
[#2] Though it is referenced fifteen times in the ONNV source, I couldn't actually find any calls. In my find | grep zeal to find the answer to this question, I accidentally recursively created a 14 TB logfile on coupe, and ignominiously removed myself from this particular pursuit of knowledge.
[#3] Some might say that I already did, with my pseudo-successful attempt to tackle 6261172 (subsequently described as "aribtrarily complex"). My belief that I could solve the problem was grounded firmly in my lack of understanding it.
[#4] This is, in fact, noteworthy. sed uses the same editing syntax as ed, which is used as a test program for the operating systems class at Brown. It's one of the few binaries provided (along with cat, ls, etc.), and if you can run it, you're in good shape. However, once you run it, how do you quit? Traditionally, nobody knows. So you reboot your simulator. Fortunately, I learned today that the answer is "q," as well as several more useful ed/sed commands. Live and learn.

Friday Jun 15, 2007

Well, after spending a day sifting through Solaris timekeeping code and a few hours hacking (eg. waiting for builds to finish), I managed to get a zone that really couldn't care less about the system time. Great... what?? In all my fervor to get things done, I never stopped to think about some important semantic questions related to having processes running in disparate "times" on the same machine.

  • First, it's pretty clear that when the time is changed in a time-independent non-global zone (here abbreviated TWZ as a tribute to my love of the nickname "timewarp zone"), the system time should not be changed, but rather the stored time offset of the TWZ should be altered such that system time + offset = requested TWZ time. However, what about if the global zone time is changed? If I add two hours to the global zone time, do I want my TWZs to also be 2 hours in the future? If they're truly independent, probably not--which means going through all the TWZs and again modifying their offsets accordingly. Is there any reason to not keep them independent (aside from avoiding the overhead of zone offsets)?
  • Second, the TWZ shares a filesystem with the global zone. More specifically, the global zone can see the TWZ's files, though not vice-versa. If the TWZ is a year ahead and touches a file, what is the access time as perceived in the global zone? Clearly, the action actually happened right now in the global zone--not a year from now. But, creating this dual-time view of the filesystem is difficult. Consider the case of a TWZ being a year ahead, touching a file, then going yet another year ahead. The TWZ should see that it was created a year ago, while the global zone should see that it was created just then. What information do we need to store this? It's not enough to store the file in "real" system time and then just apply the TWZ offset--if we did that, we'd think that the file was created at the two-year offset. So it seems the only way to maintain the dual-time filesystem illusion is to store yet more information: what zone created it, and at what time offset. And we might not have to just worry about two different times: in the case of shared mounts (NFS), there could be an arbitrary number of different TWZs sharing a filesystem. Suddenly multi-time FS (MTFS?) doesn't seem very viable. If you've made it this far, I'll keep the next ramble shorter:
  • Third, and most administratively, is there any need to be able to change the time of a zone from without? Clearly it should be alterable from within; but should there be a facility (eg. zone property) for changing it from the outside? Keep in mind that this could always be solved by use of zlogin [command].

Thursday Jun 14, 2007

You've just finished setting up the some new time-based system. Now you need to wait until it triggers tomorrow, next week, next month, etc. and see if it explodes. Or do you? Maybe you could just install a second copy on another machine, turn the clock forward, and debug it now. Even better: what if you didn't need a new machine, because you could just use a Solaris zone operating in its own time settings? Meet timewarp zones.

The idea:
Have a by-zone sense of time, offset from zone-specific data in time-retrieving calls. Allow this time to be set either as a zone property (think zoneadm) or through time-setting functions in the zone itself. Note that zones don't currently have permissions to change the system time, so this would alter a certain functionality: instead of EPERMing on time-setting calls, the zone-local time offset would be changed.

The implementation:
Though it seems like system calls would be a logical place to checkpoint (we really only want to change observable times), there's a slight snag: a few time commands are fast-trap syscalls, meaning that they circumvent the normal path and skip straight to machine code. Fast-traps save the overhead of full-blown kernel mode because they are certain to not block. For these, the checkpoint has to be at a deeper level, which is unfortunately at a level called by the non-fast-trap syscalls (so this could introduce double-offsetting). Also, at this level, time functions diverge to their architecture-specific implementations, meaning that there is no one "time" checkpoint. It will likely be simple enough to stick a common call into all of these to offset the time as necessary, but ultimately it might be nice to insert an extra common layer which unites all of them (still need to think about that one a little).

There's two classes of time-accessors to look out for: time of day (TOD) and high-resolution (hrestime). TOD is the information which is persistent from boot to boot--this is the kind that's backed by batteries. hrestime carries nanosecond precision and is derived from the system's high-resolution timer (though not directly). When you call stime, your kernel is setting both. When you call time, your kernel is actually only checking hrestime. A third class of time-accessor, hrtime, does exist, but it is used only for relative timing purposes and doesn't have any correlation to world time (it is, however, guaranteed to be monotonically increasing). Since nobody checks it for datetime and it shouldn't be compared to any other types of time, it will be exempt from the zone offset.

I'm just getting into implementation so we'll see how this goes. If there's something I'm missing in my dissection of the time subsystem, feel free to let me know--I just started looking at it about 8 hours ago.

Tuesday Jun 12, 2007

A good strategy for learning is doing, so that's what I've been up to for the past week. Sure, I've also had my nose in Solaris Internals intermittently, but there's some things that reading won't teach you. One of these is the build process, which I find documented not very well for OS. Fortunately, Dave Bustos helped me out with an overview of the bringover/workspace process and ~bustos/bin/compile, the silver bullet of compile scripts.

My first bug, "5021976 man cannot handle getcwd() failures," went fairly smoothly, and hopefully I'll get to putback it soon. It stems from the fact that man can handle relative MANPATH entries (!), which presents a problem if you can't find the current directory.

Old solution: bork
New solution: skip relative search paths only

I'm about a week behind in blogging at this point; hopefully it'll catch up sooner or later. Enough of this little trip down memory lane for now.

Monday Jun 11, 2007

I swear on the phone they told me it was going to be the colonel group.

This blog copyright 2007 by dank