about the web, software etc. Recursion, n.: see 'Recursion'

Monday Apr 06, 2009

I recently made my first attempt to merge our project gate with then-current bits from the Nevada gate; a change I wanted to benefit from had just been put back, and I was eager to get the bits.

Here's what I should have done:

  • creat a clone of the project gate (ilb-merge-clone) (we were based off of build 107 at the time)
  • reparent to onnv-clone
  • hg pull -u (this got us bits from build 111)
  • hg merge
  • hg commit
  • create webrevs vs. the project gate and vs. Nevada, inspect.
  • hg reparent to our opensolaris gate
  • hg push (to the project gate on opensolaris.org)

If I'd done it this way, all would have been fine. And I did, except for one step: before the last 'push', I did something I shouldn't have: I issued a 'hg recommit'. If you're groaning now, there's probably no news for you in the rest of this piece ;-). If not, here's a little detail:

Whenever a change to the source is done (eg. a minor milestone or the fix for a bug), you can do 'hg commit' to document this in the your repository. Such an action is also reversible with hg-means; so it makes sense to commit fairly often. Note: commit doesn't change anything in the parent repository, rather, the change in the code is communicated to your working repository. That way, you accumulate several commits before pushing them to the parent repository (or "putting back" in teamware-speak).

When you compare your work with the repository you're planning to push this to, hg uses this history to find out which files actually changed and compare those that did.

Normally, when you want to push an incremental set of changes to your project gate, it's sufficient for these changes to appear as one changeset in the project history (esp if you're planning to integrate with Nevada anyway at a later point in time, where all those changes will again be collapsed into a single change-set). 'hg recommit' collapses all changes into one.

What was different in my situation: by pulling bits from onnv-clone, I'd imported Nevada's change-set history. Unless we'd been planning to integrate our project with Nevada right away (ie before (m)any more changes to Nevada), this history would have been essential for future merges.

By removing this history, I made that step very painful - I essentially declared all changes that Nevada had introduced since we'd started our project gate, and which I'd now incorporated into our gate, as "our" changes (and not as Nevada changes we just synced up with), so every future comparison with Nevada would not only show our project work as changed, but everything that had changed in Nevada since we'd created our gate.

Lesson learned: just because something makes sense in one set of circumstances, it does not necessarily do so when the circumstances change - even if the change is minimal.

I finally got Nevada history back into our repository with a lot of help from Mark J Nelson (and that was not trivial!), and if you want to buy him a beer the next time you see him, you're more than welcome to :-)

Thursday Mar 12, 2009

here follows the text of an email that I just sent out to ilb-dev@opensolaris.org; if you feel inclined to respond, please do so on the same alias so the discussion can stay in one place. thx.

We need to nail down the semantics and the broad design of "ilbadm export" and "ilbadm import" (note, this diverges from the spec as it's currently posted, where we still talk about import-rules etc. This has been simplified).
In addition, there's what we have come to call "persistent configuration", which is conceptually an extension of ilbd, basically a copy of ildb's running configuration that persists across start/stop of ilbd (and reboot of the server). (this has nothing to do with "session persistence", btw)

We have been investigating how closely these are related and what impact this has; here's some points that we seek understanding and your input on:

* import/export: the initial requirement was to have a means to take the current configuration of the load balancer ("ilbadm export <file>"), transport it to another machine by whatever means and re-apply ("ilbadm import <file>") it there.

Q1: Would people agree that this requirement makes sense?
Q1a: if yes, would you agree that the format we use here is of secondary concern, ie. private?

Since we already have a parser in place for the CLI, it was suggested to use that format for import/export (initially, anyway) to avoid having to duplicate effort.

Q1b: does the above cause concern for anyone?

* persistent config: this was initially planned to be completely invisible to users/admins, and exclusively maintained by ilbd. (As I understood it,) this configuration file would be updated with every change the admin made via ilbadm that was not explicitly designated temporary. Whenever ilbd starts, it was supposed to clear all rules from the kernel (in case it, the daemon, had died unexpectedly), re-read the persistent config and apply all that to the kernel (ie all rules).

There is also discussion of a different usage model, namely, requiring explicit admin action ("commit") to cause the running configuration to be saved in the persistent config file. (There seems to be precedent for that in the industry, but I couldn't find a reference right now)

There are tradeoffs for both models.

Q2: which of the above models (explicit commit vs implicit update) do you think is more appropriate for us?

The argument that "we already have a parser for CLI" also seems attractive here, so we're toying with the idea of using the same syntax we use for import/export for persistent config as well; since SMF handles ilbd restart, it would be easy to add an invocation of "ilbadm read-config" (or sth. to indicate that persistent config is meant) to the service's start method.

Q3: since persistent configuration is viewed as a private component of the ilb framework, is it be acceptable from a privilege/authorisation POV to expose any interaction with it via ilbadm?

If we went with "commit" model outlined above, a suitable subcommand could be added to ilbadm, which would presumably perform the actual update to the persistent config file (in line with the "we already have ..." argument). This would have maybe even more severe implications than Q3 indicates, as ilbadm would be *writing* persistent config.

Q4: would this cause security considerations?

TIA for your (timely ;-) thoughts.

Monday Feb 09, 2009

as I announced here last week, I put back the first wad of code for the ilb project to an opensolaris.org - hosted repository. To repeat what I said in that email: this is not much more than a code drop, and things will break/go wrong/be missing. Nevertheless, if you're still interested, you're by all means welcome to give it a try and let us know how you fare. Please direct your comments to ilb-dev@opensolaris.org.

have fun!

Friday Oct 24, 2008

I'm happy to say that today we achieved another milestone in our project: Sangeeta published the design document here. Comments to ilb-dev@opensolaris.org welcome!

Sunday Jul 13, 2008

One of the requirements that need to be fulfilled by an offering in the load balancer space is he ability to periodically check the health of its' back end servers.

The health of a back end server can be defined in several ways:

  1. the ability to respond to ping
  2. the ability to perform a tcp handshake
  3. the ability for a server application (ie. http server) to respond with meaningful data to a request
  4. (your favourite method here ...)

in addition to the check, there must be an ability to report either the health status of a given server, or to report the change of status for a given server when this status changes (ie, when a server dies or comes back to life).

We examined a few open-source network monitoring tools (I think nagios was among those tools, as well as OpenNMS ... I wasn't too deeply involved in this part, so I don't know the details), but came to the conclusion that none was suited well enough for our purposes, so we decided we'd need to build our own. We still need to finalise the design, but I think I can give a basic outline of what will be required for a health check subsystem within the ILB project, as well as some of the requirements on other parts of the ilb project to accomodate HC:

  • HC will (initially) be private to ilb. 
  • we plan to implement this as a daemon, ie. hcd (health check daemon).
  • lbadm, the tool to administer ilb, will also be the only means to administer hcd.
  • hcd will not maintain any persistent state.
  • for this release, all back end servers for a lb rule will be checked by the same health check.
  • as a consequence of the above, since a server can be part of more than one rule, it must be possible to perform several checks on the same server.
  • ilb will be able to distinguish between permanent removal of a back end server (eg. by an administrator) and temporary removal of a back end server (eg. when it is unreachable over the network) from a rule.
  • hcd will implement some kind of capability to log the fact that a server has died (eg. using syslog).
I drew a crude picture of what I believe represents how hcd fits into the rest of the ilb infrastructure (so far) - I didn't spend much time on it, nor am I the born artist with electronic paint tools, so I'll ask you to excuse the craftsmanship and concentrate on the content ;-)

Monday Jun 23, 2008

I consider blogs to be "work in progress", but this entry seems to be even more so - and since it's also describing work in progress, somehow recursive :-)

One of the pieces still missing from (Open)Solaris is the capability to forward IP incoming packets to a set of (more than one) hosts from within the kernel, ie. to do load balancing.

The main benefit of an in-kernel load balancer vs. a userland-based one is the much reduced traffic of networking data ("payload") through the kernel/userland boundary. Traffic across this boundary is known to be expensive, therefore the fact that we incur less of it means that - all other things being equal - we can achieve better performance, both wrt connections per second and wrt throughput.

To address this, we recently created a prototype with very basic load balancing capabilities that we're hoping to put out on opensolaris.org once all the formalities (read: legal stuff) have been completed. You may have seen Sangeeta's email proposing this project for opensolaris: http://www.opensolaris.org/jive/thread.jspa?threadID=64639&tstart=0. We're also going to be soliciting input from people who would like to actively test this prototype.

We realise that a full product offering around a load balancer is unlikely to be achievable within the time it would make sense for us to do so, from the point of view of the addressable market, so we're going to concentrate on providing the infrastructure necessary for developers and OEMs to optimally exploit this capability we're introducing. (Plans on *when* this is going to happen, and what exactly is going to be in which delivery aren't quite finalised, so please bear with us ...)

Even before we release the code, I think I can present a short overview of what the prototype consists of. We have:
- the in-kernel forwarding engine ("ilb" = internal load balancer, which we also use as name for the whole project ...)
- the command-line utility ("lbadm").
Things like redundancy (ie. failover), backend server healthcheck etc. were not implemented for the POC.

My task was and is to define the requirements for, and then design and implement the CLI. While this sounds rather straightforward, the devil's in the detail, as usual. Here's some of the questions being asked of CLI as well as the CLI/kernel module combo, as well as their answers:

  1. what does the CLI do? (that's the obvious one ;-)
    A: Administrate all ILB rules and display associated information.
  2.  what is the "unit of currency" the ilb handles?
    A: (as indicated above) a rule. A rule consists of:
      a. a set of conditions to be met by the incoming packet
      b. the destination for a packet that matches the above conditions
      c. additional information for the load balancer.
  3.  is there precedent in Solaris for similiar functionality (ie, do we want to look at dladm or perhaps zfs)?
    A: the model we chose to follow is flowadm (coming with the crossbow project, not yet in Solaris) (see http://dlc.sun.com/osol/netvirt/downloads/20080310/flowadm.1m.txt), the basic structure is

        command subcommand [options] [object]

    and a subcommand always is of the form "verb-object" eg "show-flow" or, in the case of lbadm, "create-rule". The object in our case is the rule.
  4. how do we structure the CLI?
    A: for the prototype, the CLI was one monolithic, stand-alone binary.
  5. how does the CLI talk to the kernel?
    A: for communication between CLI and kernel, we created a data structure to contain all the relevant information and defined an ioctl for passing information to and fro.
  6. what about statistics?
    A: currently, the kernel maintains a basic set of kernel statistics (kstats); some of them for the whole module, some on a per-rule basis and some on a per-backend server basis. For the prototype, I created a shell script to read the data via kstat(1) and perform some mangling on them to produce vmstat(1)-like output.


some of the additions/modifications which will be implemented by this project:

  • the CLI functionality will be split into a library and a CLI consuming the library. The purpose of this is to enable 3rd parties to make use of this infrastructure.
  • integration of statistics display into lbadm.
  • addition of failover functionality using VRRP.
  • add configuration persistence and integrate with SMF.
  • integration with ipnat configuration1.
  • implement some form of check for the "health" of backend servers
  • enable management of several hosts as single entities (host pools)
  • connection "stickiness"


1) so far, I've not explained one major aspect: load balancing methods and topology. Topologies known in the industry are DSR (direct server return - the load balancer never sees return traffic, or just forwards it back without any modification) and NAT (half vs. full); known methods are round-robin or various forms of connection weighing. ipfilter, which has been in Solaris for quite some time and has been available as an opensource project for much longer, has some NAT functionality. For the prototype, we implemented DSR functionality seperate from ipfilter's nat functionality, and in no way integrated the administration of ipnat with lbadm



Tuesday Apr 08, 2008

I'm happy to report that Harley Hahn's recent book, "Harley Hahn's Guide to Unix and Linux", appeared on my doorstep a few days ago.  Front cover for Harley's new book

Why is this relevant to me and to Sun?

It's relevant to me because I reviewed the whole book, chapter by chapter, as Harley wrote it. I was by no means the only person to do so, as you will see when you read the acknowledgements, but I'm proud to have been part of the effort. This is a new book following Harley's earlier "Student's Guide to Unix" and "Harley Hahn's Student Guide to Unix"; I've now done this (ie, review a book on Unix for Harley) for the third time, but still there's a lot to be learned from helping (if only a little) on such a project.

I also believe it's relevant to Sun, as I helped convince Harley to use Solaris as a test platform for his examples, besides Linux and FreeBSD (Harley was kind enough to forward me some snippets of my own emails proving this). I need to thank my colleague and former team mate Helmut vom Sondern, as he was kind enough to let Harley access a zone (and was very helpful keeping it accessible from across the Atlantic) on one of his systems to do all those tests (Antoon Huiskens also helped initially).

Initially, I got involved in this effort in 1992, when Harley sent out a message on comp.unix.questions, titled "Request for opinions about a new Unix book" (here's the complete message), which I saw and responded to. Apparently, Harley saw some merit in what I had to say, and I ended up reviewing the whole book (which then became "A Student's Guide to Unix") for him. Since then, I also reviewed the 2nd edition and various other books about the Internet for Harley, though none in as much detail as the Unix books.

I've stopped reading usenet since (mainly for the sheer volume), but it shows what can happen if you're not careful ;-)

You might like to know that the artwork on the book's cover, as shown in the picture, is from one of Harley's own paintings - he has quite a few of them at home! There's still a debate going on whether he's a better author or a better painter.

 

Thursday Feb 07, 2008

The last year, since joining the Engineering organisation at Sun, I've been working on what started as the xen project (a port of the opensource xen project to Solaris) but, for reasons I won't go into, had to be renamed to xVM, a name which has since taken on almost a life of its own and is now an umbrella for a lot of the virtualisation technology at Sun.

A few words on what xVM is and how it works: xVM is a technology for running more than one instance of an operating system (such as (Open)Solaris or Linux) on the same physical machine at the same time. To provide isolation for the running OS instances, also called (guest) domains, or guests, from one another, xVM provides a thin layer called the hypervisor which acts like a virtual machine towards the running domains. The hypervisor exposes a set of interfaces to these domains by means of which the domains can access HW and communicate with one another and the outside world. This implies that the OS in question be made aware of the fact that it's running on top of the hypervisor, this is called paravirtualisation.
There are also provisions to run an OS completely agnostic of the underlying hypervisor; in xen parlance, this an HVM or fully virtualised guest. These require an additional emulation layer for I/O, which costs a performance penalty.
Finally, there are paravirtualised drivers for otherwise unmodified guests to work around the cost of the aforementioned emulation layer.

The hypervisor itself is minimal, for administrative stuff (like starting guest domains) and for things like access to network cards etc, it requires a privileged guest, dom0. All other guest domains are called domUs. (There is work underway to let domUs access eg. I/O HW directly, so the above description is a bit out of date, but it's sufficient for the following discussion.)

Opensolaris.org hosts a xen community page, where you can read a lot of good stuff about this effort.

 

Coming, as I was, from the Networking group, I worked on - surprise! - networking stuff, specifically, on network communication between domains.
Currently, domUs cannot talk to one another directly (something our friends from the LDoms league have already achieved), but have to go via dom0. You can see an illustration of the data flow here (and thx to my colleague David Edmondson for letting me share this). If you're familiar with the way this looks in Linux, you'll notice some difference. This is due to vnics, technology we borrowed from the crossbow project at an early stage, another part of our overall virtualisation strategy that's coming along nicely. More below. A word on names: in Xen parlance, a communication channel consists of a frontend and a backend. The frontend lives in domU, the backend lives in dom0 (therefore xnf and xnb)

There's two distinct mechanisms that can be used for transporting data between domains: page flipping and hypervisor copying. For page flipping, the hypervisor sets aside some of its own (fairly limited) memory, which can then be used to transfer data between domains, involving repeated mappings into and out of the domains' address space. For hypervisor copy (HVcopy), OTOH, guest domains set aside their memory and indicate this memory to the hypervisor, which copies data into this memory at another domain's request (in our case, only dom0 - this means that for now HVcopy is only used for moving data into a domU, not out)

When I joined, only page flipping had been implemented for xVM, but it was found desirable to also have hypervisor copy. The reasons I can dredge up from memory are:

  1. in high-load situations, the hypervisor would run out of pages available for page flipping, which would cause tons of error messages on the console and a significant drop in NW throughput.
  2. some Windows (drivers, AFAIK Windows itself has not been paravirtualised) only "speak" HVcopy, and we intend (Open)Solaris to be able to host Windows domUs as well.
  3. expected performance increase.

I implemented HVcopy for xnb and xnf, and am happy to report that we successfully addressed issues 1 and 3. We could not find Windows drivers that would interoperate with Solaris as domU (corrected:) dom0, so that is as yet unresolved.

 

As I mentioned above, the vnic code we got from the crossbow project which we incorporated into the xVM code, and which was putback to Nevada in build 75, was a very early version, and this code has seen some significant change in the crossbow project itself since. Recently, I updated the relevant parts of the xVM code (xnbo, xnbu) to the new "look and feel" of the crossbow API - this means that as of about mid-January, it is possible to for the crossbow "bits" to be booted as dom0, to start domUs, and - this is what it's all about, obviously, to successfully move network traffic in and out of the guest domains. As crossbow itself is not a part of Nevada yet, this change has only taken place in the crossbow "gate" (the copy of the Nevada gate where the crossbow development happens). Please refer to the crossbow community pages for when you can download what.


Update: Nicolas Droux pointed me to a more detailed document about network virtualisation in Solaris. Thx.