SteveJay's Weblog

« OpenSolaris - The... | Main | The 1394 Software... »

20050615 Wednesday June 15, 2005

InfiniBand HCA driver missing from OpenSolaris?

Yeah, that's right. Unfortunately, Sun is not yet able to open up our source code for our Solaris InfiniBand HCA driver. (One of my colleagues, Steve Rust touches on this in his most recent blog entry.) Although we wrote all the code ourselves, we did it with access to info that we got under NDA. So we're still under obligation not to disclose anything. I sincerely hope we will soon be able to open it up too, because there is some really interesting code in there that Steve R. and I are really proud of. For now, though, I guess it is among those few OpenSolaris drivers which you can get only as a binary.

The driver itself is called tavor and it basically started out as my baby (Steve R. owns it now). After my work on the Solaris 1394 Software Framework (and a handful of aborted or "development only" projects with InfiniBand HCA's), I finally got an opportunity in early 2002 to design and implement my own driver, from the ground up. The driver was to be for the Mellanox InfiniHost MT23108 HCA device, which was going to be the central I/O component in a SPARC-based blade server platform (which we never ultimately shipped).

But although the driver started out life as with a very specific purpose for a very specific (and since canceled) platform, we (the engineers) anticipated a value from the beginning if it could work well with plug-in cards. And today, a plug-in card is still the primary mechanism for adding InfiniBand to a system.

It took about a year and half of design/implementation/testing before it was ready for putback into Solaris (Steve Rust's blog says August 6th, 2003 and I'll trust him, since he was our 'gatekeeper' for the entire Solaris InfiniBand Framework putback). Subsequent to that putback, there were bug fixes (obviously), enhancements for x86 and AMD64 support, the userland access support, and (most recently) support for Shared Receive Queues (SRQ) and for the new Mellanox InfiniHost III Ex MT25208 HCA device.

The latter half of that work above was done by Steve R. and was done subsequent to Solaris 10 release. (I had the "project lead" role, he did all the hard work.) But if you want to check out the fruits of Steve R's latest work - check out Solaris Express 04/05 for the latest 'tavor' bits.

I know we're both extremely proud of this code (and really do wish we could show it off). And it's got some really fun stuff in it: handling userland access to HCA resources (i.e. OS bypass for lower latency), extreme configurability (honestly probably too configurable), a fancy mechanism for keeping track of "Work Request Identifiers" (for which I recently received US Patent #6,901,463), and a cool queue pair number allocation/reuse scheme for which Steve R. and I have a patent pending.

But anyway, I probably sound like a tease, since the driver isn't yet available in source form. But, if you're interested in InfiniBand, there's still plenty of really excellent code in OpenSolaris to check out. (Check Steve Rust's latest blog entry "InfiniBand Support in OpenSolaris" for a good starter.)

And if you've got an InfiniBand HCA card (from any of a number of vendors - Sun, TopSpin, Mellanox, etc.), then you can see this driver attached to your hardware and you can use it. (Matter of fact, like I said, this same 'tavor' driver will also attach to and operate on the latest generation of Mellanox's PCI-Express-capable InfiniHost III Ex MT25208 card. So, if you've got a system with PCI-Express - there are a few out there and I know Sun's got plenty coming - then you can get some really kick-ass performance out of our IB stack.)

Also, if you want to read more about our Solaris IB stuff, here's a few blogs by some other colleagues of mine:

There's a ton of other engineers (dozens, literally) who've contributed to the Solaris InfiniBand Framework. But maybe they're a little shy? I can't seems to find blogs by any of them. Anyway, they should all be proud too. And I'm sure that they are happy to have you folks able to see their code now in OpenSolaris.

Shoot me a comment (below) if you've used our IB software, or if you've been poking through the code. I'm very curious to hear from folks on the other end about what they think of our work.

(2005-06-15 18:14:00.0) Permalink Comments [3]

Comments:

I'm not sure I understand the point of your patented work list tracking scheme. In the Linux driver, I just keep an array of work request IDs to go along with my array of actual work requests accessed by the hardware.

When I get a CQE to process, I just take the work request address, convert that back to an index in the queue, and look up the work request ID at the same index. There's no searching, and certainly keeping two parallel arrays doesn't seem patentable to me.

Posted by Roland Dreier on September 06, 2005 at 01:25 PM EDT #

Yeah, I guess we could argue about whether two parallel arrays is "novel and non-obvious to
someone skilled in the art", as they say... and I might even agree with you.  Nonetheless,
it was granted, which I was/am happy about (that and the money and plaque that Sun gives!)...
and, come on, the idea itself _is_ more than just the sum of its data structures.

Nonetheless, I am familiar with the solution you've chosen, i.e. convert back to an index
in the queue.  And it's a fine solution.  It was actually what I first coded (about 4 years
ago now).  And it's been in that vendor's HCA code for years too.  But it didn't ultimately
solve the problem that we were trying to solve at the time.

Specifically, before the IB spec was updated to clarify this ambiguity and allow for it, we
were trying to handle the following case:

    1. Take a CQ that has completions in it.

    2. Someone frees (or resets) the QP to
       which the original work request(s) had
       been posted.

    3. Now, decide to poll the CQ and determine
       which work requests had completed (on the
       now freed QP).

If we use the indexing scheme, we can't handle step 3 above.

"So?" you might say, "the spec doesn't require you to handle that".  And you'd be right.
The IB spec was eventually updated to allow the CI to simply "throw away" those completions
and return nothing, if polled.  But at the time I did this work, that was still considered
an ambiguity in the spec, and we (Sun) considered it more important to preserve
the information about the completed requests.

You are correct that the small amount of searching costs a little bit more (one reason we
don't do it in our OS-bypass uDAPL implementation), but it's never going to search very far...
and that was the tradeoff for ensuring that completions were/are faithfully reported.

We may change that kernel code someday.  Dunno.

The easy way to make the whole issue moot, of course, would be for IB HCA vendors to allow
the WRID to be packed in the WQE and returned in the CQE.  Then there's no indexing, no
searching, no special cases to handle... the QP and the CQ are treated (as they should be,
IMHO) as two separate queues, not implicitly tied together.  But the dynamics of the IB HCA
hardware market is such that essentially one vendor holds the de facto standard.  And they
decided not to do it that way.

Thanks for confirming that someone actually reads this blog, though, Roland!  Cheers.

Posted by SteveJay on September 07, 2005 at 09:54 PM EDT #

Hi Steve. I will be working with a Solaris/Sun Customer that is having trouble getting SRP to work in Solaris with the Topspin Switches/FC Gateway. Do you know if SRP functionality is built into the Solaris driver? Or perhaps they need to use a topspin derived driver if available? It would be good if they can use the Solaris stack for complete supportability. These are all the details I have at this time.

Posted by James Triestman on April 28, 2006 at 10:22 AM EDT #

Post a Comment:

Comments are closed for this entry.