It Must Be Time for Tea

Mike Kupfer's Weblog

All | General | OpenSolaris | Solaris

20091010 Saturday October 10, 2009

SCM Mounts: Done (Almost)

I've finished the workaround for the sshd privileges issue. I ended up writing a simple setuid C program so that our PAM module could unmount the loopback filesystems. I had been using an RBAC-based approach, but that requires that the user own the mount point for each loopback mount. The more I worked on it, the more failure scenarios I ran into because of that requirement. The setuid approach had none of those issues, and it turned out to be much simpler to code than I had been expecting.

So the changes have been committed to the repository for the SCM infrastructure, and the new bits have been deployed on the backup SCM server. The only thing left is to deploy on the primary SCM server.

Unfortunately, this doesn't mean I'll now have time to finish off the OSCON trip report. Instead, I'll be focusing on a change to the way we deliver crypto binaries to ON developers.

(2009-10-10 11:35:07.0) Permalink Comments [0]

20090925 Friday September 25, 2009

Progress with SCM Mounts

I've been busy implementing a workaround to the sshd privileges issue that I mentioned a couple months ago. Unfortunately, this has meant some delays in my OSCON trip report, but I hope to post the next installment of that sometime in the next week.

Besides implementing the actual workaround, I've been beefing up the scripts that we use for setting up a test environment. Even though we have a backup production system that I could use for testing, it's safer to test a change of this size on a separate system. I sleep a lot better knowing that if I break something during development, I definitely won't inconvenience users.

I've also been learning about the privilege-management facilities that are available in (Open)Solaris. We had some problems finding a concise but sufficiently detailed writeup of how the mount/unmount privileges work. While the information is present in the umount(2) man page, a much clearer explanation is given in the output from "ppriv -lv".

(2009-09-25 14:32:59.0) Permalink

20090822 Saturday August 22, 2009

Still Reviewing Web Pages

I'm still reviewing web pages in preparation for the migration of opensolaris.org to XWiki. So far I've finished reviewing the ON Developer Reference and the SCM-related pages in the Tools web space.

The only thing left for (my) review is the SCM Migration project web space. Since that project is no longer active, I don't plan to look too carefully. But a sanity check does seem in order, and maybe there are some obsolete pages or attachments that we can delete.

The ON Developer Reference (DevRef) is a particularly tough case for the migration software because of its extensive use of anchored links. I had been planning to retire the XML (Docbook) source that the DevRef currently uses, and keep everything in XWiki markup, but I'm not looking forward to fixing all the cross-references. So I'm having second thoughts about that strategy.

(2009-08-22 16:00:16.0) Permalink Comments [2]

20090816 Sunday August 16, 2009

OpenSolaris.org Moving to XWiki

I'm afraid I haven't made any progress on my OSCON trip report this week. We've started the beta testing for migrating opensolaris.org from the current portal application to XWiki. I've been reviewing the ON Developer Reference, and it's taken more of my time than I had expected. (In fact, I'm still not done.)

If you're a community or project leader, please do take the time to review the pages that you're responsible for. Some issues will be easier to fix if they are identified before the migration. And the migration team needs user feedback to help identify which issues cause the most trouble.

If you just have a question, you can ask on the website-discuss list (at opensolaris.org). If you're sure you've found a bug, either in the migration code or in the new XWiki-based site, go to defect.opensolaris.org and file a bug under Development: product=website, component=site-wiki.

(2009-08-16 11:57:17.0) Permalink

20090721 Tuesday July 21, 2009

Unwanted Mounts

As described in the design document, source code access on opensolaris.org is done via ssh. The user doesn't invoke ssh directly. Rather, the user runs Mercurial (or Subversion), which invokes ssh using its standard processing for ssh URLs. Once connected to the server, a custom restricted shell invokes the server-side program. This is all done in a chroot environment, with loopback mounts providing access to only those repositories that the user has write access to.

The loopback mounts are created when the user logs in, and they are torn down when the source code management (SCM) operation completes. This is done by way of a custom PAM module. As part of the session's "open" processing, the module determines what repositories to grant access to, and it establishes those mount points. As part of the session's "close" processing, it removes those mount points.

We recently noticed that the loopback mounts were not getting unmounted. This causes a couple problems. One is that thousands of unused loopback mounts accumulate on the server. If nothing else, this makes life more difficult for administrators.

The lingering mounts can also lead to a denial of service problem, which we've witnessed a few times. The problem occurs if a repository is deleted and recreated while there is still a loopback mount for it. Future references to the loopback mount will fail with an error. This can interfere with the setup of a user's loopback mounts in a subsequent login, resulting in a situation where users are unable to access recently created repositories. Worse, attempts to unmount the broken loopback mount fail, and lofs doesn't support forced unmount. So the only way to recover is to reboot the server.

After the third or so instance of this, we decided to figure out why the loopback mounts were not getting unmounted. Arguments can be passed to a PAM module by putting them after the module name in /etc/pam.conf, and there's a convention to enable debugging output with the argument "debug", e.g.,

other	session requisite	pam_foo.so.1	debug
    

For this to be useful, syslogd needs to be configured to display the debug output. For example, put

auth.debug	/var/adm/auth.log
    

in syslog.conf and utter

# svcadm restart system/system-log

Once we made these two changes, we could see that the session-open routine was running normally, but it didn't look like the session-close routine was getting invoked.

This seemed awfully strange, so we enabled PAM framework debugging with

# touch /etc/pam_debug

(This, too, requires that syslogd be configured to put auth.debug output somewhere accessible.)

This showed that our session-close routine was, in fact, being invoked.

Looking more closely at the session-closed routine, we noticed that it checks what user it is invoked as. If it's not invoked as uid 0, it bails out, before doing any debug logging. Moving the debug logging to come before the uid check confirmed that it was running as the user whose session was ending.

Some Googling revealed a known issue in OpenSSH (from which the Solaris SSH is derived) in which the session-close routine is called as the session's user, not uid 0.

From the comments in the OpenSSH Bugzilla, it looks like a fix is available from upstream, so we're hopeful that we just need to talk to the Sun SSH team about getting the fix into OpenSolaris. We're also looking into possible workarounds, in case the fix can't be pulled in promptly.

Update 2009-09-16

I filed a bug for this: 6869790.

The current status is that the Solaris SSH team is discussing possible fixes, but they haven't come up with a good approach yet. Just reverting the code isn't an option because it would break support for hardware acceleration. And the upstream privilege separation code is different from the code in Solaris, so they can't just use the upstream patch.

(2009-07-21 14:29:21.0) Permalink Comments [2]

20090213 Friday February 13, 2009

OpenSolaris and gnuserv

I installed OpenSolaris 2008.11 on my notebook (a VAIO TX) several weeks ago. I've been tweaking the environment, in preparation for the day when I move to OpenSolaris on my desktop system.

One of the issues that came up was that gnuserv would exit immediately after being started. This meant that every time I wanted an editor (e.g., for a Mercurial commit), I had to wait for a new XEmacs process to start.

I looked around for some sort of error message but couldn't find anything. I finally started XEmacs using truss -f. Looking at the truss output, I saw that gnuserv was looking in /etc/hosts and not finding an entry with the notebook's hostname ("loiosh").

I added "loiosh" to the localhost (127.0.0.1) line, and that fixed the problem.

(2009-02-13 13:49:37.0) Permalink Comments [4]

20081005 Sunday October 05, 2008

Printing opensolaris.org Pages with Recent Builds

Back in August I upgraded my desktop to snv_95. I sometimes print pages from opensolaris.org to read during my commute, but with snv_95 the pages came out pretty much unreadable. They looked like they had been through several fax transmissions, with blotchy, almost indecipherable characters. At the time I chalked it up to known issues with fonts and went back to running an earlier build (thank you, Live Upgrade).

I revisited the issue last week, after noticing that the headers and footers from Firefox looked okay. It was just the main text that was messed up. I checked my preferences (Content>Fonts&Colors>Advanced)--the checkbox "Allow pages to choose their own fonts" was enabled. I disabled it and tried again, and now the printed pages are legible.

(2008-10-05 14:52:26.0) Permalink

20080409 Wednesday April 09, 2008

Converting Projects to Mercurial

One of the things that we consider when deprecating components of (Open)Solaris is how users move from the old software to the new software. We've applied that principle to the SCM Migration project, so we've been working on documentation (e.g., a Mercurial cheat sheet for TeamWare users), and the updated tools work with both TeamWare and Mercurial. Also, we don't want to tie the schedules of large projects to the SCM Migration schedule or vice versa. So we need to support projects that are begun under TeamWare, but which are still under development when we're ready to move the gate from TeamWare to Mercurial. That support is provided by a new script called wx2hg.

In general, it's hard to convert a TeamWare workspace to Mercurial, at least if you want to maintain history. But ON already has a policy that putbacks should (usually) add a single delta. That is, any project-specific history will be lost anyway. That makes the job of wx2hg a lot easier.

Suppose you have a project gate--call it my-proj--that is a child of onnv-gate, the ON master gate. We already maintain a Mercurial mirror of onnv-gate, which I will call onnv-hg for now. So when you're ready to move to Mercurial, what you want is a child of onnv-hg. That child should have the same changes relative to onnv-hg that my-proj has relative to onnv-gate.

It turns out that it is pretty easy for wx2hg to do this. The wx front-end keeps track of renames and files with contents changes. So wx2hg just needs to get that information from wx and apply it to a child of onnv-hg. The rest of the script is error detection and handling.

Let's walk through an example.

Suppose I have a workspace that deletes all the SCCS helper scripts in usr/src/tools. And to demonstrate renames, it renames the scripts directory makefile to Makefile.new.

$ pwd
/export/kupfer/tonic/wx2hg-tests/tw.no-sccs-tools.demo
$ putback -n
...

Would put back name changes: 10

rename from: usr/src/tools/scripts/Makefile
         to: usr/src/tools/scripts/Makefile.new
rename from: usr/src/tools/scripts/sccscheck.1
         to: deleted_files/usr/src/tools/scripts/sccscheck.1
rename from: usr/src/tools/scripts/sccscheck.sh
         to: deleted_files/usr/src/tools/scripts/sccscheck.sh
rename from: usr/src/tools/scripts/sccscp.1
         to: deleted_files/usr/src/tools/scripts/sccscp.1
rename from: usr/src/tools/scripts/sccscp.sh
         to: deleted_files/usr/src/tools/scripts/sccscp.sh
rename from: usr/src/tools/scripts/sccshist.sh
         to: deleted_files/usr/src/tools/scripts/sccshist.sh
rename from: usr/src/tools/scripts/sccsmv.1
         to: deleted_files/usr/src/tools/scripts/sccsmv.1
rename from: usr/src/tools/scripts/sccsmv.sh
         to: deleted_files/usr/src/tools/scripts/sccsmv.sh
rename from: usr/src/tools/scripts/sccsrm.1
         to: deleted_files/usr/src/tools/scripts/sccsrm.1
rename from: usr/src/tools/scripts/sccsrm.sh
         to: deleted_files/usr/src/tools/scripts/sccsrm.sh

The following files are currently checked out and have been edited in workspace
"/export/kupfer/tonic/wx2hg-tests/tw.no-sccs-tools.demo":
	usr/src/tools/scripts/Makefile.new
...
No changes were put back
$ 
    

Note that although Makefile.new is checked out, it need not be.

Converting this to Mercurial is simple. If your TeamWare workspace is in a directory that you have write access to, just point wx2hg at it.

$ pwd
/export/kupfer/tonic/wx2hg-tests
$ /opt/onbld/bin/wx2hg tw.no-sccs-tools.demo
    

wx2hg first creates a Mercurial child (this step can take a few minutes). The child is created in the same directory as the TeamWare workspace, with the same name plus "-hg".

requesting all changes
adding changesets
adding manifests
adding file changes
added 6349 changesets with 91335 changes to 49774 files
44994 files updated, 0 files merged, 0 files removed, 0 files unresolved
    

wx2hg then initializes wx if you haven't already done so. If the workspace is already under wx control, it does a "wx update" to ensure it will get up-to-date information about the workspace.

Initializing wx...
...
New renamed file list:
...
New active file list:
...
Will backup wx and active files if necessary
...
wx initialization complete
    

wx2hg then checks out all the files with contents changes. We want to put the files into Mercurial with unexpanded SCCS keywords, and checking them out is a quick hack to help us do so.

usr/src/tools/scripts/Makefile.new already checked out
    

wx2hg then processes the rename list.

rename usr/src/tools/scripts/Makefile -> usr/src/tools/scripts/Makefile.new
rename usr/src/tools/scripts/sccscheck.1 -> deleted_files/usr/src/tools/scripts/sccscheck.1
rename usr/src/tools/scripts/sccscheck.sh -> deleted_files/usr/src/tools/scripts/sccscheck.sh
rename usr/src/tools/scripts/sccscp.1 -> deleted_files/usr/src/tools/scripts/sccscp.1
rename usr/src/tools/scripts/sccscp.sh -> deleted_files/usr/src/tools/scripts/sccscp.sh
rename usr/src/tools/scripts/sccshist.sh -> deleted_files/usr/src/tools/scripts/sccshist.sh
rename usr/src/tools/scripts/sccsmv.1 -> deleted_files/usr/src/tools/scripts/sccsmv.1
rename usr/src/tools/scripts/sccsmv.sh -> deleted_files/usr/src/tools/scripts/sccsmv.sh
rename usr/src/tools/scripts/sccsrm.1 -> deleted_files/usr/src/tools/scripts/sccsrm.1
rename usr/src/tools/scripts/sccsrm.sh -> deleted_files/usr/src/tools/scripts/sccsrm.sh
    

After the renames, it applies a patch for each modified file...

patching file usr/src/tools/scripts/Makefile.new
    

...and then you're done.

$ ls -dF *demo*
tw.no-sccs-tools.demo/		tw.no-sccs-tools.demo-hg/
    

You can verify that wx2hg transferred all your changes:

$ cd tw.no-sccs-tools.demo-hg
$ hg diff -g
diff --git a/usr/src/tools/scripts/sccscheck.1 b/deleted_files/usr/src/tools/scripts/sccscheck.1
rename from usr/src/tools/scripts/sccscheck.1
rename to deleted_files/usr/src/tools/scripts/sccscheck.1
diff --git a/usr/src/tools/scripts/sccscheck.sh b/deleted_files/usr/src/tools/scripts/sccscheck.sh
rename from usr/src/tools/scripts/sccscheck.sh
rename to deleted_files/usr/src/tools/scripts/sccscheck.sh
diff --git a/usr/src/tools/scripts/sccscp.1 b/deleted_files/usr/src/tools/scripts/sccscp.1
rename from usr/src/tools/scripts/sccscp.1
rename to deleted_files/usr/src/tools/scripts/sccscp.1
diff --git a/usr/src/tools/scripts/sccscp.sh b/deleted_files/usr/src/tools/scripts/sccscp.sh
rename from usr/src/tools/scripts/sccscp.sh
rename to deleted_files/usr/src/tools/scripts/sccscp.sh
diff --git a/usr/src/tools/scripts/sccshist.sh b/deleted_files/usr/src/tools/scripts/sccshist.sh
rename from usr/src/tools/scripts/sccshist.sh
rename to deleted_files/usr/src/tools/scripts/sccshist.sh
diff --git a/usr/src/tools/scripts/sccsmv.1 b/deleted_files/usr/src/tools/scripts/sccsmv.1
rename from usr/src/tools/scripts/sccsmv.1
rename to deleted_files/usr/src/tools/scripts/sccsmv.1
diff --git a/usr/src/tools/scripts/sccsmv.sh b/deleted_files/usr/src/tools/scripts/sccsmv.sh
rename from usr/src/tools/scripts/sccsmv.sh
rename to deleted_files/usr/src/tools/scripts/sccsmv.sh
diff --git a/usr/src/tools/scripts/sccsrm.1 b/deleted_files/usr/src/tools/scripts/sccsrm.1
rename from usr/src/tools/scripts/sccsrm.1
rename to deleted_files/usr/src/tools/scripts/sccsrm.1
diff --git a/usr/src/tools/scripts/sccsrm.sh b/deleted_files/usr/src/tools/scripts/sccsrm.sh
rename from usr/src/tools/scripts/sccsrm.sh
rename to deleted_files/usr/src/tools/scripts/sccsrm.sh
diff --git a/usr/src/tools/scripts/Makefile b/usr/src/tools/scripts/Makefile.new
rename from usr/src/tools/scripts/Makefile
rename to usr/src/tools/scripts/Makefile.new
--- a/usr/src/tools/scripts/Makefile.new
+++ b/usr/src/tools/scripts/Makefile.new
@@ -50,11 +50,6 @@ SHFILES= \
 	nightly \
 	onblddrop \
 	protocmp.terse \
-	sccscheck \
-	sccscp \
-	sccshist \
-	sccsmv \
-	sccsrm \
 	sdrop \
 	webrev \
 	ws \
$ 
    

Note that you still need to do "hg commit" to check in your new version.

All this assumes that your workspace is in sync with /ws/onnv-clone. If it isn't you may get messages like

wx2hg: can't rename: usr/src/tools/scripts/sccscheck.1 doesn't exist.
    

or

wx2hg: usr/src/tools/scripts/Makefile.new: parent mismatch; 
  resync with /ws/onnv-clone or specify branch point with -r hg_rev.
    

Doing a bringover from /ws/onnv-clone, and resolving any conflicts, should fix things up.

You may also see a message like

Please run
  hg --cwd /export/kupfer/tonic/wx2hg-tests/tw.no-sccs-tools.demo-hg update -C
before retrying.
    

This is telling you you can reuse the Mercurial child, but you need to reset it first. Once you've resynched with /ws/onnv-clone and run the "hg ... update..." command, you use the -t option to tell wx2hg to reuse the Mercurial child. For example,

/opt/onbld/bin/wx2hg -t tw.no-sccs-tools.demo-hg tw.no-sccs-tools.demo
    

There's more that wx2hg can do, but those features won't be needed until ON moves to Mercurial. If you get stuck using wx2hg, you can ask for help on the SCM migration team list (scm-migration-dev at opensolaris dot org).

(2008-04-09 15:27:47.0) Permalink

20080215 Friday February 15, 2008

SCM Migration: The Big Picture

When Steve Lau left Sun at the end of last September, I became the go-to guy inside Sun for the migration to Mercurial. I had thought that I had a good high-level grasp of the project. But after getting blindsided a couple times by dependencies I hadn't considered, I drew up a diagram to help me get oriented, identify stakeholders, and maybe anticipate future issues.

Here's a slightly simplified version of the original diagram from the whiteboard in my office:

Blue parallelograms indicate repositories, tan boxes are software modules, solid lines indicate data flow, and dashed lines tie users with the modules that they're using. The three red-rimmed boxes (gk tools, gate hooks, and onbld tools) are where most of the development effort is going.

The primary simplifications in this diagram are

  • the data flow from the project gate actually goes through the SCM front-end before going through the gate hooks.
  • I've omitted the consolidation's clone workspace (a nightly snapshot of the gate)
  • I've omitted the bridge between the current ON workspace in TeamWare and the Mercurial repository that is shadowing it

Even so, this is a moderately busy diagram. There are several components to keep track of and make sure they all fit together.

Most of the work so far has been in the area of the ON build (onbld) tools, pieces of which are used by other consolidations and by the Solaris Companion project. Many of the changes are related to making the tools work with Mercurial as well as with TeamWare/SCCS. We've also had to consider the implications of moving everything outside the Sun firewall, which has meant rethinking interfaces to things like the bug database and our RTI (Request To Integrate) system.

We haven't done as much work on the gatekeeper (gk) tools, although we've started to think about design issues. Many of the design decisions boil down to this question: do we make the minimal set of changes needed to work with Mercurial, or do we make more extensive changes so that the tools can make better use of the features provided by Mercurial? In some cases we are staying with the current approach. For example, we are using separate repositories for build snapshots, rather than using branches and tags in the main gate repository. In other cases we will be changing the tools to use Mercurial features. For example, any automated post-putback processing will be driven directly by Mercurial hooks, rather than the email-based hook system that is needed with TeamWare.

Another set of interesting design decisions has centered around the use of gate hooks to enforce various style and bookkeeping rules. With the current TeamWare setup, we enforce these rules after a putback (at least for ON). The putback triggers various checks, and if your putback violates a rule, you get notified of the problem and given a short window to fix it or your putback is reverted. The gate is normally configured so that anyone (inside Sun) can putback.

While this approach worked when Solaris was closed source, we expect it not to scale for OpenSolaris, where the repository is accessible from anywhere on the Internet and both Sun employees and non-employees can have commit rights. Certain Mercurial hooks can abort a putback ("push" in Mercurial terms), so we could move all the post-putback checks to pre-transaction checks. But moving more checks means more work (e.g., testing), which means a longer time before we can move to Mercurial. So the question becomes which checks really need to happen before putback, and which ones can happen after putback. The check to ensure that a putback has an approved RTI probably needs to happen prior to the putback. The check for adherence to the C style rules can happen after the putback, at least for now.

The opensolaris.org webapp has various bits of functionality for source code management. A project leader or gatekeeper can use the webapp to create, destroy, and lock repositories, as well as to manage commit rights for the project's repositories. Unfortunately, the current set of operations is limited. For example, a gatekeeper might want to lock a repository for most users, but allow access for a specific large project. Alas, this lock granularity is not currently supported. Furthermore, all the controls are currently through a web-based interface, with no scripting hooks. Although there is currently work to improve the webapp and make it easier to change, this work is unlikely to be finished in time for us to make any changes that we expect gatekeepers to want. So we will need to think about other ways to provide the needed functionality, such as giving gatekeepers shell access to the server that hosts the repositories.

The SCM front-end gives a user access to repositories by creating a chroot environment which contains only the repositories that the user has commit privileges for. (Access to other repositories is done via the "anon" user.) If the user reports being unable to pull from, or push to, a repository, the problem could be with the SCM program itself, the SCM front-end, or some other general system service. This diagnosis typically requires shell access to the servers.

We are using Nagios to monitor the health of the servers and services on opensolaris.org. We have written a couple simple Nagios plugins to monitor the Mercurial and Subversion services. As we gain experience with the system, we could update the probes to check for specific failure scenarios.

OpenGrok makes it into this diagram because it makes a private snapshot of each repository that it indexes, so as to provide a consistent view of the tree. We once managed to break the OpenGrok indexing of ON by trying to undo (rollback) a particular putback, so that it would vanish completely from the repository. We didn't know to roll back OpenGrok's snapshot repository as well. So the next time OpenGrok tried to pull from the Mercurial onnv-gate, it created a branch that had to be merged. This was not something OpenGrok was prepared for, so the snapshot tree was not updated. After several days, we started getting complaints from ON teams who couldn't find their recent putbacks in OpenGrok. We figured out the problem, replaced OpenGrok's snapshot repositories, and vowed not to undo/rollback any future putbacks.

So that's the "big picture" of what the SCM Migration project is working on. If you've been frustrated by how long things are taking, well, we're not happy about it, either. Our hope is that by keeping the entire picture in mind, we will not have any serious problems when we finally do move.

(2008-02-15 14:20:46.0) Permalink Comments [4]

20070817 Friday August 17, 2007

ksh93 Putback

April Chin put back ksh93 into the ON gate this morning. Woohoo! I'm delighted to have a modern, open-source Korn shell in OpenSolaris, and I'm looking forward to when we can (someday) retire the old Solaris ksh. Many thanks to April, Roland Mainz, and Don Cragun for all their work, as well as to everyone who participated in the project reviews and discussions.

(2007-08-17 21:13:34.0) Permalink

20070518 Friday May 18, 2007

Defeating the OpenSolaris Address Mangler

The opensolaris.org webapp includes an automatic email address mangler to make it harder for spammers to harvest email addresses. But it's not very smart, and it mangles things that aren't email addresses, like device paths and repository URLs. If you're editing an HTML page on the web site and you want to bypass the email mangler, replace "@" with "@", as in

ssh://anon@hg.opensolaris.org/hg/onnv/onnv-gate
(2007-05-18 15:41:59.0) Permalink

20061103 Friday November 03, 2006

Testing nightly(1)

I've been making a lot of changes to nightly(1), the main ON build script. With most of the build tools, you can test your changes by adding the -t flag to your nightly options, but that doesn't work for nightly itself.

For awhile, I was making $HOME/bin/nightly.new be a symbolic link to the new version, in whatever workspace it lived in. That got a bit awkward if I had more than one workspace with changes to nightly. Worse, because I had set things up so that I didn't really have to think about what I was doing, well, I wouldn't think about what I was doing--I would invoke nightly.new before invoking make to update it.

So now I manually do

$ cp usr/src/tools/proto/opt/onbld/bin/nightly ~/bin/nightly.new

It's more typing, but I've had to rerun fewer tests than I used to.


Technorati tags: OpenSolaris

(2006-11-03 16:10:08.0) Permalink

20060614 Wednesday June 14, 2006

.o Files: Just Say No

As I mentioned in a previous entry, the ON sources that are available via opensolaris.org are a subset of the ON consolidation in the Solaris product (90% as of build 42, measured in lines of text). We've organized the ON source so that if part of a component (library, command, kernel module) is closed, the entire component is treated as closed. Closed components are in their own subtree (usr/closed), which parallels the open source tree (usr/src).

Since the OpenSolaris launch, I've gotten a few requests to support partial-source components. This is where the component is mostly open source, but it contains some code that can't be delivered as source for one reason or another. The usual proposal is to modify the OpenSolaris build to support a mix of .o and source files, similar to the way Sun and other vendors delivered their Unix kernels in the 1980s.

The appeal of supporting closed .o files is that once the infrastructure is in place, the open source for these components can be easily exposed outside Sun. And there's some precedent for this practice, such as the closed-source files that were split out from libc to libc_i18n.a, and the uuencoded .o files in the ath driver. But there are problems with this approach, and those problems are sufficiently large that we have no plans to implement it.

The most obvious problem with this approach is that it complicates the OpenSolaris build infrastructure. Mechanisms have to be implemented to identify which .o files need to be included with the closed binaries, and to restore them after a "make clean" (but only for external developers, mind you). By itself, this extra complexity wouldn't be sufficient to kill the .o approach, but it means we don't want to spend energy on it unless it's an approach we want to keep.

A second problem is that this approach blurs the separation between open and closed source. If we no longer have a system where everything in usr/src is delivered and nothing in usr/closed is delivered, we raise the risk that someone will accidentally copy closed code into an open file and forget to tag the open file as closed.

A third problem with this approach is that it makes it harder to work in the open source tree. Imagine you're working on a component that contains foo.c, bar.c, common.h, and baz.o, and you want to change a function myfunc() that's defined in foo.c. You can use nm(1) to determine that baz.o calls myfunc(), but without a copy of baz.c, you can't tell if your changes will break the code in baz.c or not.

Worse, suppose you want to change a struct that's defined in common.h. Maybe it's used for baz.o, maybe it isn't; there's no good way to tell[1]. And if it is used, all sorts of nastiness, including random memory corruption, is possible.

A fourth, more strategic, problem with this approach is that by reducing the barriers to having closed source in the Solaris product, it reduces the incentive to eliminate (or open up) the closed source.

A fifth and final problem with this approach is that it assumes a delivery model where the master workspace is internal to Sun, and the external workspace is just a mirror that is produced by filtering the main workspace. That's not the model we're after in the long term. Rather, we eventually want the external workspace to be the master.

So what about the precedents that I mentioned earlier? Let's look at libc and libc_i18n.a first. The libc library is such an important component that keeping it closed would have doomed OpenSolaris from the start. So we first tried creating a dynamic library (libc_i18n.so), where we could use spec files to enumerate and enforce the interface dependencies between libc_i18n and libc. But there was a performance hit that we couldn't figure out how to work around. So we settled for moving the code into a separate tree and doing some analysis to show that the libc_i18n code is fairly self-contained. That is, changes to private interfaces inside libc are unlikely to break the code in libc_i18n. But this isn't really a satisfactory approach. We adopted it with a great deal of reluctance, and only because of libc's importance.

The other precedent that I mentioned was the ath driver, which has some uuencoded .o files in the source tree. This approach reflects the regulatory requirements for wireless devices in (at least) the USA. Government certification is required to deploy these devices, and given the flexibility of the Atheros chip set, the Hardware Abstraction Layer (HAL) software is part of what gets certified. Change the HAL binaries, and you invalidate the certification. So even in Sun's internal source tree, the HAL files are kept as uuencoded binaries, not as source. This means that many of the issues with a general .o mechanism don't apply here. There is no special-case makefile magic for "make clean", and external developers get exactly what internal developers get.

In summary, the .o approach requires non-trivial work, it has legal risks, and it treats the external community as second-class citizens. For these reasons, we will not pursue it.

Notes

[1] We could set up the makefiles in such a way that you could tell if baz.o depends on something in common.h, but you wouldn't know exactly what.


Technorati tags: OpenSolaris Solaris

(2006-06-14 09:00:00.0) Permalink

20051022 Saturday October 22, 2005

On the Road to Nightly Updates: Split Workspace

One of the milestones we're working towards for OpenSolaris is nightly updates. Currently we deliver updates every two weeks in the form of tarballs. Nightly updates would have some advantages for external developers, particularly if we do the deliveries using a source code management system instead of tarballs. First, it's usually easier to merge multiple sets of smaller changes than one massive set of changes. So it would be easier to synchronize project workspaces with the master sources. Second, if something breaks, it's easier to track down a problem if you know it was introduced on a particular date. That's because there are a lot fewer deltas to look through, compared to examining all the putbacks for a two-week period.

One of the requirements for nightly updates is making the delivery totally automated. The current biweekly deliveries to opensolaris.org require several staff hours each. Part of the problem is that the master workspace, /ws/onnv-gate, is organized as a single source tree that contains both open and closed source. The deliveries are done from a "split" workspace, with separate open and closed trees. But each delivery requires pulling over changes from /ws/onnv-gate[1] to the split workspace and reviewing it for changes. New files are reviewed to see if they should go into the closed tree. Makefile changes have to be merged to ensure that the open tree still builds standalone. We can't afford to do this every day.

So one of the requirements for nightly updates is splitting /ws/onnv-gate into open and closed trees. This pushes the makefile changes and the review of new files closer to the people who are doing the bug fixing and project work. That scales better.

This begs the question, why hasn't /ws/onnv-gate been split already? Why not just put back the workspace that we're doing the biweekly deliveries from?

The short answer is that /ws/onnv-gate has requirements that our delivery workspace does not. For example, /ws/onnv-gate has to be able to build both the open and closed trees. The delivery workspace only has to build the open tree. /ws/onnv-gate has to produce both debug and non-debug builds. The delivery workspace only produces debug builds.

So one of the things I've been working on since before the OpenSolaris launch is a set of changes that splits /ws/onnv-gate and meets all the requirements like being able to build both the open and closed trees. The changes require edits to around 600 files, and approximately 3300 files will get moved.

I've been meaning to blog about this for some time, but it's only recently that I made this a priority. I had originally figured that since the work mostly had to do with support for the closed tree, external developers wouldn't care much about it. But a recent thread on opensolaris-discuss got me thinking that since the changes are visible to external developers, it's better to share my plans, and to do it now, rather than when I put back.

More Details

With two parallel trees, one for open code (usr/src) and one for closed code (usr/closed), it seems natural to build the open tree and then build the closed tree. Unfortunately, a portion of libc must remain closed for now, and it must be built before libc can be built as a whole.[2]

Another approach might be to build the closed tree first. That won't work, because there is code in the closed tree that depends on having the headers installed from the open tree.

It might work to build both trees in stages, doing both trees for each stage before proceeding to the next stage. That is, install all the headers, build the open kernel code and then the closed kernel code, build the closed libraries and then the open libraries, etc. But this approach is fragile. As soon as one closed library requires that an open library be built first, this approach will break.

So what I've done is drive the entire build from the open tree, and let it jump into closed subdirectories as needed. For example, usr/src/lib/Makefile has

SUBDIRS= \
[...]
	../cmd/sgs/libdl	.WAIT	\
	$(CLOSED)/lib/libc_i18n		\
	libc			.WAIT	\
[...]
    

and the rule to build SUBDIRS is

$(SUBDIRS) abi: FRC
	@if [ -f $@/Makefile  ]; then \
		cd $@; pwd; $(MAKE) $(TARGET); \
	else \
		true; \
	fi
    

In this scheme, the build dependency between libc and the closed i18n code is expressed as

libc:		$(CLOSED)/lib/libc_i18n
    

If the libc_i18n sources are present, the [ -f $@/Makefile ] test will succeed, and the code will get built. If they aren't present, the test fails, /usr/bin/true is executed, and the libc_i18n target is a no-op.

For OpenSolaris builds, the closed binaries tarball will get copied into the proto area at the start of the build. Thus the fact that the closed tree is missing is harmless.

Status and the Future

The status of this work is that it's mostly implemented. We still have some cleanup to do. I expect our testing will find a few things that will need fixing prior to putback, which is currently targeted for snv_28.

Besides splitting the tree, there are a few other things that need to be done before we can have automated nightly deliveries. One item is providing better automation for generating the closed binaries tarball. Currently this is handled by building a full ON workspace, building an OpenSolaris workspace, and generating the delta between the two workspaces. This is awkward to automate because it depends on coordinating the successful builds of two workspaces, and our build scripts (nightly.sh in particular) aren't really set up for that. Of course, we could modify our scripts. But a better approach would be to modify the closed tree makefiles, so that a "make install" populates a "closed binaries" tree as well as the proto area. This would also make it easier to deliver non-debug kernel modules.

Another task is to finish automating the generation of other deliverables, such as the BFU and SUNWonbld tarballs, and tie everything together so it can be run via nightly.sh.

A third task is to move the OpenSolaris downloads from the Sun Download Center (SDLC). The Studio compiler downloads need to stay on the SDLC, but the OpenSolaris sources, BFU archives, SUNWonbld binaries, and closed binaries will be moved to opensolaris.org. The reason for this move is that a certain amount of manual work is required any time we update files on the SDLC. Hosting the files on opensolaris.org gives us the freedom to implement an approach that can be 100% script-driven.

At some point we will start delivering the sources using a read-only Subversion repository.[3] I'm not sure whether we will have that in place before we move to nightly updates.

Besides nightly updates, another OpenSolaris milestone that splitting the tree facilitates is using opensolaris.org to host project work. Right now it's difficult for internal projects to move to opensolaris.org because most internal engineers don't have a clear picture of what files can be exposed externally and which ones must be kept confidential. After the tree has been split, the picture becomes clear: files in usr/src are okay to post on opensolaris.org, files in usr/closed must stay internal.


Notes:

[1] actually, we bringover from the snapshot workspace for the particular build that we're delivering.

[2] we'll probably need to fix this before we move the master workspace to be external.

[3] this isn't the final source code management setup. The point is to deploy something that's easier to work with than tarballs.


Technorati tags: OpenSolaris

(2005-10-22 14:16:32.0) Permalink Comments [0]

20050811 Thursday August 11, 2005

Send In the Builds

Send in the builds
There ought to be builds
Why aren't they here?[1]

It's been a hectic couple months since the OpenSolaris launch. Some folks have been asking why isn't Sun producing the regular biweekly builds like we said we would, and I'd like to let everyone know what's happening. First, some background:

The Solaris group puts out a numbered build every two weeks. The ON gatekeepers take a snapshot of the ON gate every other Monday evening. This gives them a week to deal with any problems found in Pre-Integration Testing (PIT). They then deliver packages built from the snapshot into the WOS. The Release Engineering staff produces ISO images from the WOS packages. These images then show up in places like Solaris Express.

What I did for the OpenSolaris Pilot, and what I planned to do after the launch (until we get a source code management system installed), was to merge the biweekly ON snapshot into the workspace that I use to generate the OpenSolaris deliveries. During the Pilot this typically took a couple days per delivery, and most of that was waiting for builds to complete. So I figured I could get a bunch of other work done after the launch while still producing the OpenSolaris deliveries at regular intervals.

Silly me.

The short explanation is that I've been extremely busy, and producing the deliveries hasn't been at the very top of my priority list. Instead, I've been working with some of the other core OpenSolaris team to improve the automation of the delivery process and to make it so that others will be able to produce the deliveries. To make things worse, I just moved and bought a house, so I'm having to take vacation time a fair amount right now.

Since it's been a couple weeks since the last delivery, I'm going to back-burner that other work and start integrating the crypto code that Darren Moffat and Dan McDonald have worked to make available. In parallel, Steve Lau is going to start on the build 20 delivery.

So my best guess is that the next delivery will be somewhere around 17-20 August. I'm not sure how much of the crypto code the next delivery will contain. I plan to work that out with Steve next week.

Steve has also been working on automation for the delivery process. So I'm hoping that soon after the next delivery we'll be able to get on a more regular schedule.

During the Pilot I tried to send out regular status announcements (weekly, if I remember correctly). This kept everyone informed as to when the next delivery was likely to happen. And for the folks external to Sun, it provided a useful heartbeat, showing that the Sun folks were still there and still working on things. Cyril Plisko has suggested that I resume regular status updates. Given the erratic delivery schedule since the launch, I think he has a point. But I want to also consider venue and frequency. I'd like for opensolaris-code to stay as technical as possible. opensolaris-discuss is an alternative, but it has a lot of traffic, and I don't want to add to that unless the postings are really useful. So right now I'm thinking of setting up a biweekly schedule and then only sending out mail if it looks like we'll miss the advertised delivery date.


[1] With apologies to Stephen Sondheim.


Technorati tags: OpenSolaris

(2005-08-11 10:44:50.0) Permalink Comments [2]

Calendar

« November 2009
SunMonTueWedThuFriSat
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today

RSS Feeds

XML
All
/General
/OpenSolaris
/Solaris

Search

Links




Navigation



Referers

Today's Page Hits: 203