Solaris x86, Device DMA, and the DDI
I'm going to start a monthly blog entry on a DDI subject of your choice...
Now that
OpenSolaris is available, I
can get into some decent detail referencing kernel code when needed.
So I'm seeking a topic for July... Anyone interested, submit your DDI
topic of interest in the comments and I'll pick one for July....
For June, I'm going to talk about Solaris x86, device DMA, and the DDI..
Mostly because that's what I have been spending some of my time on lately...
I write code, not documents, so don't expect too much. I can't spell worth a
damn... Grammar is poor. Use the wrong words a lot. I'll probably jump around
a lot too :-). Hopefully, you'll still get something meaningful out of this ;-)
I'm assuming for this entry you already know a little bit about the
DDI DMA interfaces in Solaris. If not, you can look at the following
manpages for a little background...
- ddi_dma_alloc_handle(9M)
- ddi_dma_free_handle(9M)
- ddi_dma_addr_bind_handle(9M)
- ddi_dma_buf_bind_handle(9M)
- ddi_dma_unbind_handle(9M)
- ddi_dma_sync(9M)
- ddi_dma_getwin(9M)
- ddi_dma_nextcookie(9M)
- ddi_dma_attr(9S)
- ddi_dma_cookie(9S)
The implementation of these routines live in
sunddi.c
where sunddi.o resides in genunix (/kernel/amd64/genunix). You'll see when
you look at this code, that most of these routines are just simple wrappers
which will eventually end up in architecture specific code.
Jumping ahead a little, on x86, we end up in the
rootnex
driver. The rootnex driver is the x86 root nexus driver. Nexus drivers
implement the
busops
interface in
dev_ops.
Basically, drivers are hierarchical where nexus drivers can have children
which could be either other nexus drivers or leaf drivers. A leaf driver is the
last driver in a branch (i.e. can't have any children). The root nexus driver
is the root driver, similar to the root of a filesystem. Anyway, that's a
subject of another entry. For now, just trust me that we end up in rootnex
for x86 :-)
So a quick mapping of code is:
ddi_dma_nextcookie
stays in genunix...
Now let me be the first to say this is pretty rough code... This should be
changing soon, but for now, you have been warned... So now that you know where
the code is, I'm going to jump back up to a higher level...
If your still interested, you probably already know the normal sequence is to
alloc a dma handle, bind a buffer, get your cookies (physical addresses
to DMA into), sync if reading from memory, do your DMA, etc...
When a dma handle is allocated, the rootnex driver will do some validation on
the
ddi_dma_attr
and pre-allocate some state for you. Nothing very exciting...
The fun stuff happens in the bind code which will be the topic for the rest
of this entry.. Instead of walking through the code, I'll walk through the
concepts... The code should be changing soon, so I don't want to spend a lot of
time on code which may not be the same by the time you read this. But first
some terminology I'll be using which doesn't always match up with other folks
terminology. Sometimes I like to redefine things too :-).
- Scatter/Gather List (SGL) - a list of physically contiguous buffers
- Cookie - single physically contiguous buffer i.e. a SGL element.
- SGL Length - The maximum number of cookies/SGL elements the DMA engine
supports
- Copy Buffer - bounce buffer/intermediate buffer. Used as a temporary buffer
to DMA to/from when the DMA engine can't reach the physical address we are
supposed to MDA into.
Jumping to the fun stuff, the first concept in the bind, is how the buffer is
passed down to bind. It can be a kernel virtual address (KVA) w/ size, a
linked list of physical pages (without a kernel address), or an array of
physical pages (with a kernel virtual address [shadow I/O]). For each page in
the buffer, the rootnex driver has to make sure that the dma engine can reach
the physical address. There is a DMA engine low address limit and a high
address limit passed in the
ddi_dma_attr
during ddi_dma_alloc_handle(9M) which the rootnex driver uses to do this.
For every page which can't be reached, the rootnex driver will use part of a
copy buffer. For these pages, the device will DMA into the copy buffer, and
not the actual buffer. The data will be copied to/from the copy buffer when the
driver calls ddi_dma_sync(9F). So the driver better make sure they have syncs
in the right place and have the direction correct! Continuing... The copy
buffer has a fixed maximum size. Each bind will get its own copy buffer if
needed. If the amount of copybuf required in a single bind is greater than the
maximum size of a copy buffer, the bind will need to be a partial bind and
will require multiple windows. This is a concept I'll talk about further down..
What happens when a linked list of physical pages w/o a KVA comes down you
asked? Good question! Well, currently, the rootnex driver will allocate some
KVA space (vmem_alloc) without physical memory to back it up and then maps
it to the physical page on the fly during sync. Not pretty. This should be
changing for the 64-bit kernel in the near future (homework: what is seg kpm).
How come the DMA engine can reach the copy buffer, but can't reach the original
DMA buffer you ask? You guys are good... Well, most DMA buffers originate
from userland or from a kernel stack which has no idea what the constraints
of the DMA engine are (and it shouldn't since there may be multiple DMA
engines with different constraints). The copy buffer is allocated from
the same underlying routines that ddi_dma_mem_alloc(9F) uses, which takes into
account the DMA engines constraints. i.e. the copy buffer is allocated
specifically for the DMA engine we are using...
The copybuf code path got, and is still getting, a lot of usage in s10 and
above once we went to a 64-bit kernel on x86. The number of x86 machines with >
4G of memory has gone up tremendously since you can actually use the memory more
efficiently these days. OK, maybe efficiently isn't the right word, but you
get my point...;-) A lot of devices only support 32-bit DMA addresses,
so they correctly set their DMA high address to 0xFFFF.FFFF. Any physical
address above this will require a copy buffer on x86 (On SPARC, we have an
IOMMU so it doesn't have this problem, but that's a different entry).
Jumping to the side for a sec... don't confuse 64-bit DMA addresses with a
64-bit card. You may have a 32-bit/33MHz PCI card which supports 64-bit address
via dual address cycles (DAC), you may have a 64-bit/66Mhz PCI card which only
supports 32-bit DMA addresses, or you could have a x8 PCI Express card which
only supports 32-bit DMA addresses. The speed of the card and the number of
bytes that can be transfered in a clock have nothing to do with the DMA address
width. If a device only supports a 32-bit DMA address, it will not be able to
reach memory above 4G and will require a copy buffer.
Jumping back. It gets more interesting from here. Memory organization on SMP
Opteron systems is very similar to our SPARC systems. The memory controller
is in the CPU chip (which could have multiple cores). So if I have a two
chip Opteron based system, I have at least 2 memory controllers. Solaris is
smart, and will allocate memory closest to the core you are running on. Going
back to the 2 chip Opteron system. If the system has 16G of memory, and I
am a process running on chip 2, when I allocate memory, it's physical address
will be above 8G (0 - 8G is attached to chip 1). So all I/O on chip 2 will
need a copy buffer for a DMA engine with a high address limit of 0xFFFF.FFFF.
Lessons learned, if you want performance on this type of system, use a device
which supports 64-bit DMA addresses. And make sure if your device supports
64-bit DMA addresses, the driver supports 64-bit DMA addresses!
OK, enough about copy buffers. Jumping back to
ddi_dma_attr
for a moment. dma_attr_align is used during ddi_dma_mem_alloc(9F), don't
expect it to do anything for you in the bind. dma_attr_count_max and
dma_attr_seg limit the size of a cookie. If I have a 1M buffer which is
physically contiguous, normally I would get a sgl length of 1 and the single
cookie would be 1M in size. If I set seg or count_max to 256K-1, I would get
a sgl length of 4 or 5 (depending on if the start address was page aligned)
where each cookie would be <= 256K in size. Why do we have both seg and
count_max? don't know...
OK, we finally arrive at windows... The fun stops here. Basically, a window is
supposed to be a piece of a DMA bind that fits within the DMA
constraints. i.e. if I have a bind for which the DMA engine cannot handle in a
single transfer, and the driver/device supports partial mappings, the DDI is
supposed to break it into multiple windows where each window can be handled
by the DMA engine. Again, jumping back to
ddi_dma_attr
. There are three things which should require the use of multiple
windows during a bind:
- We need more copybuf space then the maximum copy buffer size allowed
- The number of cookies required to bind the buffer is greater than the
maximum number of cookies the H/W can handle (dma_attr_sgllen)
- The size of the bind is greater than the maximum transfer size of the
DMA engine (dma_attr_maxxfer)
But, from a historical note, that's not the way it was original implemented
on the original x86 port. At the time this was written, the only time you will
get multiple windows is when we need more copybuf space then the maximum copy
buffer size allowed. This should be fixed shortly, but you will still have
to handle how the current implementation works for the driver to operate
correctly on s10 and before (I'll explain what that behavior is shortly).
Don't worry though, once this is fixed, a driver which handles the old behavior
will still work great with the correct behavior.
Once we need multiple windows, the rootnex now has to worry about the
granularity of the device (dma_attr_granular). A device can only transfer
data in even multiples of the granularity. e.g. if the granularity is set to
512, the size of a window must be an even multiple of 512. So when the rootnex
gets to the end of a window, it sometimes has to subtract some data from the
current window and put it into the next window to ensure the current window
size is a multiple of granularity. This is referred to as trimming in the code.
This gets pretty complicated with the way the rootnex DMA code is currently
architected, and was the source of a fair number of bugs for which I had to put
some not so obvious hacks in there to fix..
And last, but not least, what happens today in a bind when the driver supports
a partial bind and one of the two conditions are hit:
- The number of cookies required to bind the buffer is greater than the
maximum number of cookies the H/W can handle (dma_attr_sgllen)
- The size of the bind is greater than the maximum transfer size of the
DMA engine (dma_attr_maxxfer)
Tune in next week..
Sorry couldn't resist.. Well, I don't think there's an official word for it,
but I'm going to make up something as I type, because, remember, I like to do
that sort of thing ;-). We get a superwindow, where a superwindow is a window
larger than the DMA engine can handle. However, a superwindow is properly
trimmed at the conditions mentioned above. So when the driver is going
through the cookies, if the next cookie puts it over the DMA engines sgllen
or maxxfer size, it can consider that cookie the start of the next window.
So it puts more work back on the driver writer. Of course, if you've already
written a Solaris driver for x86 which supports partial mappings, you have
probably already figured that out :-/.
Well, that's enough for this month. I have code to finish up and putback.
Don't forget to submit your DDI topic of interest in the comments section for
next month...
MRJ
Technorati Tag: OpenSolaris
Technorati Tag: Solaris