Your probably here because you searched for "x86 rootnex warning" or something like that on google. :-) If so, you came to the right place. I'm including below, a slightly edited heads up message that I sent out when I putback the bug fix related to this warning. It gives some details on the warning.
But first, a little background. Towards the end of s10 development, I fixed a couple of old x86 rootnex bugs. One of these incorrectly passed a dma bind operation instead of failing it. This fix ended up finding a few driver bugs. Since we were getting relatively close to s10 RR, and this was a hard to diagnose problem, I ended up printing a warning when we hit this condition. We found a few 3rd party drivers with sgllen related bugs in Solaris Express after this was fixed.
I heard some folks still occasionally run into this, so I figured I'd put this info out there... Here's the rootnex code in question and here's the original bug. Below is a slightly edited version of the heads-up that I sent out..
My recent putback of:
(P1) 3001685 ddi: DMA breakup routines do not match DDI specification
4926500 ddi: x86 DMA breakup with sgllen==1 can give ncookies!=1
4796610 ddivs_dmae/ddi_check_dma_handle_* 3 assertions fail due to a product bug
could cause existing buggy drivers which are thought to be functioning
correctly, to start failing.
------------------
o description of the problem - what's wrong
The x86 root nexus's ddi_dma_*_bind_handle() implementation
can wrongly return more cookies than a driver specifies is
the maximum that it can handle (when DDI_DMA_PARTIAL is not
specified). This can be a very serious problem, causing silent
data corruption. There can be a lot of corner cases depending
on memory fragmentation, making it difficult to test for.
Unfortunately, the fix for these bugs also can/has exposed
minor bugs in existing drivers, which could be hard to identify
and debug. An example is the elxl driver which specified that
it could only handle 1 cookie for the tx data buffers, but
really could handle 2 (and was getting 2 occasionally).
o does the problem happen on sparc? why not?
This problem does *not* happen on SPARC. This is a bug
in the x86 rootnex driver. This part of the code is not
shared with any SPARC nexus driver code.
o what rootnex was doing wrong
The rootnex driver would return success from a DMA bind
operation when the cookie count was larger than the maximum
the driver specified that it could handle, and the driver
specified that it couldn't handle partial mappings.
o what rootnex is now doing right
The rootnex driver fails a DMA bind operation when
the cookie count would be larger than the maximum
that a driver specifies that it can handle, and the
driver specifies that it cannot handle partial mappings.
o what drivers were doing wrong
The vast majority of drivers aren't doing anything wrong.
There may be a few which are not handling the DMA bind
operation correctly. For example, if a driver cannot
handle partial DMAs, it should be able to handle the
following # of cookies ((max possible bind size / page size) + 1).
If it cannot handle that number of cookies, it must be
able to expect and correctly handle a failed bind
operation.
o what they need to do right.
If a driver cannot handle partial DMAs, it should be able
to handle the following # of cookies
((max possible bind size / page size) + 1).
If it cannot handle that number of cookies, it must be
able to expect and correctly handle a failed bind
operation.
o does this mean a driver which was working ok before
could now broken, because a driver may have been
written to work around the rootnex bug?
No. A driver didn't need to workaround the bug. It does
mean that a driver which was working or was appearing to
work before, could now be broken. This could be because the
code path for the failing bind operation has never been
tested or the driver never expected the bind operation
to fail (i.e. the driver has a bug which they never
debugged since it never failed before unless they noticed
memory corruption).
o instructions for using flags
There are two patchables rootnex_bind_fail & rootnex_bind_warn. The
following table explains the behavior of the new bind operation
failure. The current default behavior is to fails the bind and
print one warning message per major number.
rootnex_bind_fail | rootnex_bind_warn | Results if sgllen < *ccount
| | && !DDI_DMA_PARTIAL
---------------------------------------------------------------------
0 | 0 | behaves like code today (bind succeeds, no warning message)
0 | 1 | bind succeeds, print one warning/major#
1 | 0 | fails the bind, no warning message
1 | 1 | fails the bind, print one warning/major#
To revert to the previous behavior, which is incorrectly returning
success from these ddi_dma_*_bind_handle() operations, put the
following in /etc/system then reboot.
set rootnex:rootnex_bind_fail = 0
To disable the warning message, put the following in /etc/system
then reboot.
set rootnex:rootnex_bind_warn = 0
Anyone fixing bugs found by the warning should make sure they test their
fix *without* rootnex_bind_fail = 0 and set kmem_flags = 3f to ensure they
clean up correctly after the failure (since it's likely that code path
hasn't been tested). e.g. put the following in /etc/system
set rootnex:rootnex_bind_fail = 1
set kmem_flags = 0x3f
o so you've got a driver's source code. how can you
inspect the code to check for possible errors
and how can you fix them?
First you must understand what is the maximum possible bind
size that your driver will see. This may not be trivial.
For example, the ata driver handles ~64K maximum buffers
for normal operation, but newfs goes through /dev/rdsk
which doesn't break buffers down to ~64K buffers initially.
Once you understand maximum possible bind size, make sure
sgllen is set to ((max possible bind size / page size) + 1)
and that you actually handle that many cookies. Or make
sure you handle the fail case correctly.
o can you run your fixed driver on earlier solaris releases?
Yes. A correctly written driver will run fine on both S10
and earlier solaris releases. This bug fix only fixes a
problem of not correctly identifying a failure case. If a
driver doesn't generate a failure case, or correctly
handles the failure case, there will not be any problems.
o What can I do if my system doesn't boot anymore
In the unlikely event your system doesn't boot anymore,
you can recover the system by setting a defered breakpoint
in rootnex_attach, clear the rootnex_bind_fail patchable,
then, once you've booted, add set rootnex:rootnex_bind_fail = 0
to /etc/system
An example is provide below...
Boot args:
Type b [file-name] [boot-flags] to boot with options
or i to enter boot interpreter
or to boot with defaults
<<< timeout in 5 seconds >>>
Select (b)oot or (i)nterpreter: b kmdb -d
Loading kmdb...
Welcome to kmdb
kmdb: Unable to determine terminal type: assuming `vt100'
[0]> ::bp rootnex`rootnex_attach
[0]> :c
SunOS Release 5.10 Version gate:2004-10-18 32-bit
Copyright 1983-2004 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Loaded modules: [ ufs unix krtld genunix specfs ]
kmdb: stop at rootnex`rootnex_attach
kmdb: target stopped at:
rootnex`rootnex_attach: pushl %ebp
[0]> rootnex`rootnex_bind_fail?W 0
rootnex`rootnex_bind_fail: 0x1 = 0x0
[0]> :c
Hostname: ...
[hostname] console login: root
Password:
[CUT]
bfu'ed from /ws/on10-gate/archives/i386/nightly-nd on 2004-10-18
Sun Microsystems Inc. SunOS 5.10 s10_68 December 2004
# echo "set rootnex:rootnex_bind_fail = 0" >> /etc/system
#