布鲁塞尔(Brussels),OpenSolaris的新项目
Abstract:
This article talks about background of Brussels, what Brussels is and what it will deliver. If you are interested and wish to contribute, please go to http://www.opensolaris.org/os/project/brussels/ to participate.
如果您曾经配置过Solaris的网络驱动程序,您一定配置过驱动提供的许多参数。现在Solaris里驱动提供了大量的可调整参数以方便用户根据自己的需求进行定制,但凡事皆有两面,这使得正确配置参数成为了一种让人头痛的工作。有些参数通过/etc/systems,driver.conf来设置,有些参数通过ndd来设置,而且许多参数缺乏文档,经常让有经验的工程师也摸不着头脑。
现在我们正在进行一个OpenSolaris的新项目,目的就是为用户提供一个一致的、清晰的、可管理的驱动配置系统。这个项目的名字以比利时首都布鲁塞尔的名字来命名,叫Brussels,您可以在这里找到它最新的信息。http://www.opensolaris.org/os/project/brussels/
产品将会扩展GLDv3提供的命令dladm, 用户将可以通过dladm来配置/查看驱动程序的参数,而ndd/driver.conf的配置方式将同时存在一段时间。项目还将给用户提供易用的图形管理界面(GUI)。
如果您是Solaris系统管理员、支持工程师、用户,您一定会希望知道“布鲁塞尔“能为您的工作和研究带来哪些改变和帮助。
尽管这个项目还在设计中,您可以随时加入OpenSolaris讨论组的邮件列表来关注项目的进展。最重要的是您的想法会影响我们的项目,您的宝贵建议会体现在最后的代码和产品中!
Abstract:
This article talks about background of Brussels, what Brussels is and what it will deliver. If you are interested and wish to contribute, please go to http://www.opensolaris.org/os/project/brussels/ to participate.
Posted at 03:54下午 六月 23, 2007 by raymond in Personal | 评论[2]
Diag. ICMP checksum error in kernel with Dtrace
This topic talks about how to locate a kernel checksum error problem.
The problem is interesting: when enable jumbo frame to 16K Bytes, test
pairs could not ping each other with specified ping payload size(2006
Bytes in this case) on x64 platforms.
While checking netstat infor on remote side, there is "icmpInCksumErrs"
increasing with the incoming ICMP echo request. The checksum error are
from either sending or receiving side then.
Sending side calculates ICMP checksum in application. In ping.c, line 1940 - 1946
1940 if (family == AF_INET) {
1941 if (!use_udp)
1942 icp->icmp_cksum = in_cksum((ushort_t *)icp, cc);
1943
1944 i = sendto(send_sock, (char *)out_pkt, cc, 0, whereto,
1945 sizeof (struct sockaddr_in));
1946 }
Ethereal sniffering results shows the ICMP checksum sent by application
is correct. So the problem must be the ICMP checksum func in kernel
caused the problem.
How could we prove this? We could start by checking input IP packet contents, using below D-script:
ip_input:entry
{
/*extern void ip_input(ill_t *, ill_rx_ring_t *, mblk_t *, size_t);
get the mblk of input packet */
self->ip_mp = (mblk_t *)(arg2);
/* make self->iph point to the data header, IP header here */
self->iph = (ipha_t *)(self->ip_mp->b_rptr);
/* We only interest in ICMP packets here, so we need protocol */
self->protocol = self->iph->ipha_protocol;
}
ip_cksum:entry
/self->protocol == 1/ /* ICMP only */
{
stack(); /* just curious, no need here */
}
ip_cksum:return
{
printf("0x%x\n", arg1); /* get the ip_cksum return value*/
}
Note, in icmp_inbound, there are two places where ICMP checksum are calculated(ip.c):
first is the LINE 1741 - 1747, which validates incoming ICMP checksum. The return value should be 0.
1741 /* ICMP header checksum, including checksum field, should be zero. */
1742 if (sum_valid ? (sum != 0 && sum != 0xFFFF) :
1743 IP_CSUM(mp, iph_hdr_length, 0)) {
1744 BUMP_MIB(&icmp_mib, icmpInCksumErrs);
1745 freemsg(first_mp);
1746 return;
1747 }
second is LINE 2013 - 2017, which calculates outgoing ICMP packet and should be the right ICMP checksum.
2013 /* Send out an ICMP packet */
2014 icmph->icmph_checksum = 0;
2015 icmph->icmph_checksum = IP_CSUM(mp, iph_hdr_length, 0);
2016 if (icmph->icmph_checksum == 0)
2017 icmph->icmph_checksum = 0xFFFF;
Let's run above D-script on a normal system, with background ping traffic, the output is below:
# dtrace -s icmp.d
dtrace: script 'icmp.d' matched 5 probes
CPU
ID
FUNCTION:NAME
0
25208
ip_cksum:entry
ip`icmp_inbound+0x120
ip`ip_proto_input+0xa62
ip`ip_input+0x619
dls`i_dls_link_rx_promisc+0x213
mac`mac_rx+0x53
e1000g`e1000g_intr+0xc4
unix`intr_thread+0x136
0
25209
ip_cksum:return return: 0xffff <--- Incoming
ICMP~(0xffff) == 0, correct
0
25208
ip_cksum:entry
ip`icmp_inbound+0x5e3
ip`ip_proto_input+0xa62
ip`ip_input+0x619
dls`i_dls_link_rx_promisc+0x213
mac`mac_rx+0x53
e1000g`e1000g_intr+0xc4
unix`intr_thread+0x136
0
25209
ip_cksum:return return: 0xce97 <--- Outgoing ICMP, not zero
Comparing with what we see in the bugy system:
bash-3.00# dtrace -s icmp.d
dtrace: script 'icmp.d' matched 5 probes
CPU
ID
FUNCTION:NAME
1
23017
ip_cksum:entry
ip`icmp_inbound+0x24a
ip`ip_proto_input+0x479
ip`ip_input+0x4ab
dls`i_dls_link_rx+0x18c
mac`mac_rx+0x45
e1000g`e1000g_intr_work+0x176
e1000g`e1000g_intr+0x3c
unix`av_dispatch_autovect+0x78
unix`intr_thread+0x50
1
23018
ip_cksum:return return: 0x417d <--- should be 0xffff
...
1
23017
ip_cksum:entry
ip`icmp_inbound+0x24a
ip`ip_proto_input+0x479
ip`ip_input+0x4ab
dls`i_dls_link_rx+0x18c
mac`mac_rx+0x45
e1000g`e1000g_intr_work+0x176
e1000g`e1000g_intr+0x3c
unix`av_dispatch_autovect+0x78
unix`intr_thread+0x50
1
23018
ip_cksum:return return: 0x60e9 <--- should be 0xffff
ip_cksum validates wrong checksum when dealing with incoming ICMP
packets. Since ip_ocsum will be called by ip_cksum when checking IP
protocol checksum, we could take a close look into ip_ocsum and learn
why this problem could happen.
ip_ocsum in Solaris are platform dependent. For example, SPARC has ip_ocsum.s while x86 has i86_subr.s.
This counts for why we see different symbols on various platforms. On SPARC, the ping problem happens with another payload size.
See following articles on the checksum code.
Posted at 04:14下午 七月 31, 2006 by raymond in Sun | 评论[9]
mutex panic: bad mutex
The direct trigger could be: a mutex is reused after it's destroyed. For example, see below code:
typedef struct drv{
kmutex_t drv_lock;
...
} dtv_t;
static void
drv_unattach(drv_t *drvp)
{
mutex_destroy(drvp->drv_lock);
...
ddi_intr_remove_handler(...);
}
Above code remove interrupt handler after mutex_destroy. The problem is
after the mutex destroyed, there is still possibility for an interrupt
to come in. If the mutex is used in the interrupt context, the "bad
mutex" panic will happen.
Posted at 11:41上午 五月 23, 2006 by raymond in Sun | 评论[0]
mutex panic: recursive mutex enter
Examples in real life: recursive mutex enter
[Read More]Posted at 10:47上午 五月 23, 2006 by raymond in Sun | 评论[2]
mutex panic
Summary panic caused by mutex usage
[Read More]Posted at 11:11上午 五月 17, 2006 by raymond in Sun | 评论[0]
Don't bite by "\n" in perl scripting
Below perl code tends to create a hash with key/value pair from diamond operator, and then query the existance of entry which is read from ANOTHER_FILE in the hash table.
Will below perl code work as expected?
while ( my $next_call = <> )
{
@key_value = split /\s+/, $next_call;
$value = pop(@key_value);
$key = pop(@key_value);
$drv_func_call{$key} = $value;
}
while ( my $next_line = <ANOTHER_FILE> )
{
if ( ! $drv_func_call{$next_line} )
{
printf "NOT CALLED: %s\n", $next_line;
}
}
Unfortunately, it will not. The reason is the $next_line reading from ANOTHER_FILE has "\n" at the end of string. That is the searching key is: $next_line."\n". While the $key of hash is retrived by "split" without newline deliminator attached, $drv_func_call{$next_line} will return no value as a result.
Perl uses chomp operator to remove "\n" after a string, which will fix the problem.
Below are quoted from: http://perldoc.perl.org/functions/chomp.html
chomp VARIABLE
chomp INPUT_RECORD_SEPARATOR $/ newline eol
- chomp( LIST )
- chomp
This safer version of chop removes any trailing string
that corresponds to the current value of$/
(also known as
$INPUT_RECORD_SEPARATOR in theEnglishmodule). It returns the total
number of characters removed from all its arguments. It's often used to
remove the newline from the end of an input record when you're worried
that the final record may be missing its newline. When in paragraph
mode ($/ = ""
), it removes all trailing newlines from the string.
When in slurp mode ($/ = undef
) or fixed-length record mode ($/
is
a reference to an integer or the like, see perlvar) chomp() won't
remove anything.
If VARIABLE is omitted, it chomps$_
. Example:while (<>) {
chomp; # avoid \n on last field
@array = split(/:/);
# ...
}
If VARIABLE is a hash, it chomps the hash's values, but not its keys.
You can actually chomp anything that's an lvalue, including an assignment:
chomp($cwd = `pwd`);
chomp($answer =);
If you chomp a list, each element is chomped, and the total number of
characters removed is returned.
If the
encodingpragma is in scope then the lengths returned are
calculated from the length of$/
in Unicode characters, which is not
always the same as the length of$/
in the native encoding.
Note that parentheses are necessary when you're chomping anything
that is not a simple variable. This is becausechomp $cwd = `pwd`;
is interpreted as(chomp $cwd) = `pwd`;
, rather than as
chomp( $cwd = `pwd` )
which you might expect. Similarly,
chomp $a, $b
is interpreted aschomp($a), $b
rather than
aschomp($a, $b)
.
while ( my $next_call = <> )
{
@key_value = split /\s+/, $next_call;
$value = pop(@key_value);
$key = pop(@key_value);
$drv_func_call{$key} = $value;
}
while ( my $next_line = <ANOTHER_FILE> )
{
chomp($next_line);
if ( ! $drv_func_call{$next_line} )
{
printf "NOT CALLED: %s\n", $next_line;
}
}
Posted at 02:01下午 四月 17, 2006 by raymond in Sun | 评论[0]
[Follow up] Looking into a core dump of deadlock
[follow up last post]
We will see how the above deadlock illustrated in core dump generated after deadlock happens.
<SUT> mdb 0
/* let's find the hang dladm */
> ::ps ! grep dladm
R 12476 12268 12268 725 0 0x42004000 fffffe87705e23d0 dladm
/* see what are its threads */
> fffffe87705e23d0::ps -t
S PID PPID PGID SID UID FLAGS ADDR NAME
R 12476 12268 12268 725 0 0x42004000 fffffe87705e23d0 dladm
T 0xffffffffa69be740 <TS_SLEEP>
0xffffffffa69be740 is in state <TS_SLEEP>, let's see what it is waiting for:
> 0xffffffffa69be740::findstack -v
stack pointer for thread ffffffffa69be740: fffffe8000a2e570
[ fffffe8000a2e570 _resume_from_idle+0xf8() ]
fffffe8000a2e5b0 swtch+0x185()
fffffe8000a2e650 turnstile_block+0x80d(0, 0, ffffffff93785580,
fffffffffbc039d8, 0, 0)
fffffe8000a2e6b0 rw_enter_sleep+0x186(ffffffff93785580, 0)
fffffe8000a2e6f0 mac_rx_remove+0x32(ffffffff93785310, fffffe86d2d1e368)
fffffe8000a2e720 aggr_port_delete+0x3d(ffffffff8614ccb0)
fffffe8000a2e780 aggr_grp_rem_port+0x1e4(ffffffff85798720, ffffffff8614ccb0,
fffffe8000a2e7e4)
fffffe8000a2e800 aggr_grp_rem_ports+0x182(1, 1, ffffffff9d1faeb8)
fffffe8000a2e840 aggr_ioc_remove+0xc5(ffffffffa27164c0, 100000)
fffffe8000a2e890 aggr_ioctl+0x9b(fffffe8692c06108, ffffffffa27164c0)
fffffe8000a2e8b0 aggr_wput+0x2e(fffffe8692c06108, ffffffffa27164c0)
fffffe8000a2e920 putnext+0x246(fffffe80caeb3658, ffffffffa27164c0)
fffffe8000a2e9f0 strdoioctl+0x3bb(fffffe8692bfee68, fffffe8000a2ea68, 100003,
1, ffffffff9ec93f40, fffffe8000a2ee9c)
fffffe8000a2ed20 strioctl+0x3a73(ffffffff94512e80, 5308, 8037490, 100003, 1,
ffffffff9ec93f40, fffffe8000a2ee9c)
fffffe8000a2ed70 spec_ioctl+0x83(ffffffff94512e80, 5308, 8037490, 100003,
ffffffff9ec93f40, fffffe8000a2ee9c)
fffffe8000a2edc0 fop_ioctl+0x36(ffffffff94512e80, 5308, 8037490, 100003,
ffffffff9ec93f40, fffffe8000a2ee9c)
It sleeps on a reader/writer lock 0xffffffff93785580. Let's see further in what state the lock is:
> ffffffff93785580::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
ffffffff93785580 READERS=1 B011 ffffffffa69be740 (W)
||
WRITE_WANTED -------+|
HAS_WAITERS --------+
>
It has a reader already, since "READERS=1". Let's see what is that:
> ffffffff93785580::kgrep | ::whatis
fffffe8000359b28 is in thread fffffe8000359c80's stack
fffffe8000a2e5d8 is in thread ffffffffa69be740's stack
fffffe8000a2e638 is in thread ffffffffa69be740's stack
fffffe8000a2e670 is in thread ffffffffa69be740's stack
fffffe8000a2e6a8 is in thread ffffffffa69be740's stack
ffffffffa2f3ce48 is ffffffffa2f3ce38+10, bufctl ffffffffa2f5b478 allocated from
turnstile_cache
ffffffffa69be7c0 is ffffffffa69be740+80, allocated as a thread structure
We already know 0xffffffffa69be740 is thread of dladm. So what is 0xfffffe8000359c80's stack looks like?
> fffffe8000359c80::findstack -v
stack pointer for thread fffffe8000359c80: fffffe80003592b0
[ fffffe80003592b0 resume_from_intr+0xbb() ]
fffffe80003592f0 swtch+0xad()
fffffe8000359390 turnstile_block+0x80d(0, 1, ffffffff85798720,
fffffffffbc039d8, 0, 0)
fffffe80003593f0 rw_enter_sleep+0x1fb(ffffffff85798720, 1)
fffffe8000359430 aggr_m_tx+0x2d(ffffffff85798720, ffffffff98778be0)
fffffe8000359450 dls_tx+0x20(ffffffff9540ca88, ffffffff98778be0)
fffffe8000359480 str_mdata_fastpath_put+0x2c(ffffffff85be6338,
ffffffff98778be0)
fffffe8000359520 tcp_send_data+0x6d7(fffffe80caedbf80, ffffffffa12eb0f8,
ffffffff98778be0)
fffffe8000359600 tcp_send+0x87b(ffffffffa12eb0f8, fffffe80caedbf80, 5b4, 28,
14, 0, fffffe80003596ac, fffffe80003596b0, fffffe80003596b4, fffffe8000359678
, 4fe574, 7fffffff)
fffffe80003596d0 tcp_wput_data+0x6db(fffffe80caedbf80, 0, 0)
fffffe8000359820 tcp_rput_data+0x2dbc(fffffe80caedbd80, ffffffffa0ecd8c0,
ffffffff843def40)
fffffe80003598b0 squeue_drain+0x212(ffffffff843def40, 4, 2fc20433e599)
fffffe8000359930 squeue_enter_chain+0x3bb(ffffffff843def40, ffffffffa66b4f60,
ffffffff98ede720, 10, 1)
fffffe8000359a00 ip_input+0x780(ffffffff85fadae8, fffffe8692bf3088,
ffffffffa66b4f60, e)
fffffe8000359ab0 i_dls_link_ether_rx+0x1ae(ffffffff93784290, fffffe8692bf3088
, ffffffffa66b4f60)
fffffe8000359b00 mac_rx+0x7a(ffffffff85798750, fffffe8692bf3088,
ffffffffa66b4f60)
fffffe8000359b70 aggr_recv_cb+0x1b9(ffffffff8614ccb0, fffffe8692bf3088,
ffffffffa66b4f60)
fffffe8000359bc0 mac_rx+0x7a(ffffffffa2112e00, fffffe8692bf3088,
ffffffffa66b4f60)
fffffe8000359c00 e1000g_intr+0xd2(fffffe813c11f000)
fffffe8000359c60 av_dispatch_autovect+0x83(19)
fffffe8000359c70 intr_thread+0x50()
>
This guy sleeps on another rwlock 0xffffffff85798720. Let's take a look:
> ffffffff85798720::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
ffffffff85798720 ffffffffa69be740 B101 fffffe8000359c80 (R)
| | fffffe800016dc80 (R)
WRITE_LOCKED ------+ |
HAS_WAITERS --------+
Current thread(e1000g receive) is trying to take a lock as READER which is already held by 0xffffffffa69be740 (which is dladm's as mentioned before, held as WRITER).
While thread 0x0xffffffffa69be740 is blocked on lock 0xffffffff93785580, it tries to acquire the lock as RW_WRITER. (see from the dis/src code)
mac_rx_remove: pushq %rbp
mac_rx_remove+1: movq %rsp,%rbp
mac_rx_remove+4: subq $0x18,%rsp
mac_rx_remove+8: pushq %r12
mac_rx_remove+0xa: pushq %r13
mac_rx_remove+0xc: pushq %r14
mac_rx_remove+0xe: movq %rdi,-0x8(%rbp)
mac_rx_remove+0x12: movq %rdi,%r14
mac_rx_remove+0x15: movq %rsi,-0x10(%rbp)
mac_rx_remove+0x19: movq %rsi,%r12
mac_rx_remove+0x1c: movq %r14,%r13
mac_rx_remove+0x1f: addq $0x270,%r13
mac_rx_remove+0x26: movq %r13,%rdi
mac_rx_remove+0x29: xorl %esi,%esi
mac_rx_remove+0x2b: xorl %eax,%eax
mac_rx_remove+0x2d: call -0x38746f <rw_enter> /* the first "rw_enter" comparing with source code */
mac_rx_remove+0x32: movq %r14,%r8
But mac_rx has held it as RW_READER already.
> mac_rx::dis
mac_rx: pushq %rbp
mac_rx+1: movq %rsp,%rbp
mac_rx+4: subq $0x20,%rsp
mac_rx+8: pushq %r12
mac_rx+0xa: pushq %r13
mac_rx+0xc: pushq %r14
mac_rx+0xe: pushq %r15
mac_rx+0x10: movq %rdi,-0x8(%rbp)
mac_rx+0x14: movq %rsi,-0x10(%rbp)
mac_rx+0x18: movq %rsi,%r15
mac_rx+0x1b: movq %rdx,-0x18(%rbp)
mac_rx+0x1f: movq %rdx,%r13
mac_rx+0x22: movq 0x10(%rdi),%r14
mac_rx+0x26: movq %r14,%r12
mac_rx+0x29: addq $0x270,%r12
mac_rx+0x30: movq %r12,%rdi
mac_rx+0x33: movl $0x1,%esi
mac_rx+0x38: xorl %eax,%eax
mac_rx+0x3a: call -0x3879f3 <rw_enter>
mac_rx+0x3f: movq 0x278(%r14),%r14
mac_rx+0x46: testq %r14,%r14
mac_rx+0x49: je +0x45 <mac_rx+0x8e>
mac_rx+0x4b: movq (%r14),%r8
mac_rx+0x4e: testq %r8,%r8
mac_rx+0x51: je +0x11 <mac_rx+0x62>
mac_rx+0x53: movq %r13,%rdi
mac_rx+0x56: xorl %eax,%eax
mac_rx+0x58: call -0x1a7153 <copymsgchain>
mac_rx+0x5d: movq %rax,%rdx
mac_rx+0x60: jmp +0x5 <mac_rx+0x65>
mac_rx+0x62: movq %r13,%rdx
mac_rx+0x65: testq %rdx,%rdx
mac_rx+0x68: je +0x12 <mac_rx+0x7a>
mac_rx+0x6a: movq 0x8(%r14),%r8
mac_rx+0x6e: movq 0x10(%r14),%rdi
mac_rx+0x72: movq %r15,%rsi
mac_rx+0x75: xorl %eax,%eax
mac_rx+0x77: call *%r8
So this is what the deadlock looks like with a core dump file.
Technorati Tags: mdb OpenSolaris
Posted at 05:42下午 四月 12, 2006 by raymond in Sun | 评论[0]
A R/W deadlock of aggregation in GLD code
I met this problem when running heavy traffic over an aggregation and
add/remove interfaces into the aggregation.
This is a good example of read/write deadlock problem.
First, let's explain how the deadlock will happen.
1. When a packet of TCP incoming and triggerres interrupt, it will
following below call sequence:
driver_xxx_intr -> mac_rx -> (a serie of TCP funcs) -> aggr_m_tx
Above call sequence will acquire rw lock in below sequence:
(1) mac_rx -> mi_rx_lock (as RW_READER) mac.c, LINE 1145
(2) aggr_m_tx -> lg_lock (as RW_READER) aggr_send.c, LINE 220
/* See below code */
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/mac/mac.c
, line 1136 - 1167
1136 void
1137 mac_rx(mac_t *mp, mac_resource_handle_t mrh, mblk_t *bp)
1138 {
1139 mac_impl_t *mip = mp->m_impl;
1140 mac_rx_fn_t *mrfp;
1141
1142 /*
1143 * Call all registered receive functions.
1144 */
1145 rw_enter(&mip->mi_rx_lock, RW_READER);
1146 mrfp = mip->mi_mrfp;
1147 if (mrfp == NULL) {
1148 /* There are no registered receive functions. */
1149 freemsgchain(bp);
1150 rw_exit(&mip->mi_rx_lock);
1151 return;
1152 }
1153 do {
1154 mblk_t *recv_bp;
1155
1156 if (mrfp->mrf_nextp != NULL) {
1157 /* XXX Do we bump a counter if copymsgchain() fails? */
1158 recv_bp = copymsgchain(bp);
1159 } else {
1160 recv_bp = bp;
1161 }
1162 if (recv_bp != NULL)
1163 mrfp->mrf_fn(mrfp->mrf_arg, mrh, recv_bp);
1164 mrfp = mrfp->mrf_nextp;
1165 } while (mrfp != NULL);
1166 rw_exit(&mip->mi_rx_lock);
1167 }
When packet arrived the interface, interrupt handler will call into mac_rx.
In Line 1145, mip->mi_rx_lock will be acquired as RW_READER here.
While in aggr codes, http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/aggr/aggr_send.c. In LINE 220, it tries to acquire grp->lg_lock as RW_READER.
212 mblk_t *
213 aggr_m_tx(void *arg, mblk_t *mp)
214 {
215 aggr_grp_t *grp = arg;
216 aggr_port_t *port;
217 mblk_t *nextp;
218 const mac_txinfo_t *mtp;
219
220 rw_enter(&grp->lg_lock, RW_READER);
221
222 if (grp->lg_ntx_ports == 0) {
223 /*
224 * We could have returned from aggr_m_start() before
225 * the ports were actually attached. Drop the chain.
226 */
227 rw_exit(&grp->lg_lock);
228
229 freemsgchain(mp);
230 return (NULL);
231 }
232
233 for (;;) {
234 nextp = mp->b_next;
235 mp->b_next = NULL;
236
237 port = grp->lg_tx_ports[aggr_send_port(grp, mp)];
238 ASSERT(port->lp_state == AGGR_PORT_STATE_ATTACHED);
239
240 rw_exit(&grp->lg_lock);
241
242 /*
243 * We store the transmit info pointer locally in case it
244 * changes between loading mt_fn and mt_arg.
245 */
246 mtp = port->lp_txinfo;
247 if ((mp = mtp->mt_fn(mtp->mt_arg, mp)) != NULL) {
248 mp->b_next = nextp;
249 goto done;
250 }
251
252 if ((mp = nextp) == NULL)
253 goto done;
254
255 rw_enter(&grp->lg_lock, RW_READER);
256 }
257
258 done:
259 return (mp);
260 }
2. When adminstrator using dladm is to remove an interface from current aggregation(with dladm remove-aggr), it will follow below call sequence:
aggr_ioctl -> aggr_ioc_remove -> aggr_grp_rem_ports -> aggr_grp_rem_port -> aggr_port_delete -> mac_rx_remove
So the mi_rx_lock and lg_lock will be held in sequence:
(1) aggr_grp_rem_ports -> acquire "lg_lock" (as RW_WRITER), aggr_grp.c LINE 861
(2) mac_rx_remove -> acquire "mi_rx_lock" (as RW_WRITER), mac.c LINE 941
/* See below code */
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/aggr/aggr_grp.c, LINE 842 - 909,
842 int
843 aggr_grp_rem_ports(uint32_t key, uint_t nports, laioc_port_t *ports)
844 {
845 int rc = 0, i;
846 aggr_grp_t *grp = NULL;
847 aggr_port_t *port;
848 boolean_t notify = B_FALSE, grp_mac_addr_changed;
849
850 /* get group corresponding to key */
851 rw_enter(&aggr_grp_lock, RW_READER);
852 if (mod_hash_find(aggr_grp_hash, GRP_HASH_KEY(key),
853 (mod_hash_val_t *)&grp) != 0) {
854 rw_exit(&aggr_grp_lock);
855 return (ENOENT);
856 }
857 AGGR_GRP_REFHOLD(grp);
858 rw_exit(&aggr_grp_lock);
859
860 AGGR_LACP_LOCK(grp);
861 rw_enter(&grp->lg_lock, RW_WRITER);
862
863 /* we need to keep at least one port per group */
864 if (nports >= grp->lg_nports) {
865 rc = EINVAL;
866 goto bail;
867 }
868
869 /* first verify that all the groups are valid */
870 for (i = 0; i < nports; i++) {
871 if (aggr_grp_port_lookup(grp, ports[i].lp_devname,
872 ports[i].lp_port) == NULL) {
873 /* port not found */
874 rc = ENOENT;
875 goto bail;
876 }
877 }
878
879 /* remove the specified ports from group */
880 for (i = 0; i < nports && !grp->lg_closing; i++) {
881 /* lookup port */
882 port = aggr_grp_port_lookup(grp, ports[i].lp_devname,
883 ports[i].lp_port);
884 ASSERT(port != NULL);
885
886 /* stop port if group has already been started */
887 if (grp->lg_started) {
888 rw_enter(&port->lp_lock, RW_WRITER);
889 aggr_port_stop(port);
890 rw_exit(&port->lp_lock);
891 }
892
893 /* remove port from group */
894 rc = aggr_grp_rem_port(grp, port, &grp_mac_addr_changed);
895 ASSERT(rc == 0);
896 notify = notify || grp_mac_addr_changed;
897 }
898
899 bail:
900 rw_exit(&grp->lg_lock);
901 AGGR_LACP_UNLOCK(grp);
902 if (notify && !grp->lg_closing)
903 mac_unicst_update(&grp->lg_mac, grp->lg_addr);
904 if (rc == 0 && !grp->lg_closing)
905 mac_resource_update(&grp->lg_mac);
906 AGGR_GRP_REFRELE(grp);
907
908 return (rc);
909 }
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/mac/mac.c, LINE 930 - 953,
930 void
931 mac_rx_remove(mac_handle_t mh, mac_rx_handle_t mrh)
932 {
933 mac_impl_t *mip = (mac_impl_t *)mh;
934 mac_rx_fn_t *mrfp = (mac_rx_fn_t *)mrh;
935 mac_rx_fn_t **pp;
936 mac_rx_fn_t *p;
937
938 /*
939 * Search the 'rx' callback list for the function closure.
940 */
941 rw_enter(&(mip->mi_rx_lock), RW_WRITER);
942 for (pp = &(mip->mi_mrfp); (p = *pp) != NULL; pp = &(p->mrf_nextp)) {
943 if (p == mrfp)
944 break;
945 }
946 ASSERT(p != NULL);
947
948 /* Remove it from the list. */
949 *pp = p->mrf_nextp;
950 kmem_free(mrfp, sizeof (mac_rx_fn_t));
951 rw_exit(&(mip->mi_rx_lock));
952 }
3. The deadlock will happen in below scene:
(1) thread 1, dladm calls into "aggr_grp_rem_ports" and acquire "lg_lock" as RW_WRITER
(2) thread 2, an packet arrived at the aggregation and interrupt handler calls "mac_rx" and acquire "mi_rx_lock" as RW_READER
(3) thread 2, mac_rx calls into "aggr_m_tx" and tries to acquire "lg_lock" as RW_READER, but currently it is held by step (1) as RW_WRITER, so thread 2 will block
(4) thread 1, aggr_grp_rem_ports calls into "mac_rx_remove" and tris to acquire "mi_rx_lock" as RW_WRITER, but currently it is held by step (2) as RW_READER, so thread 1 will block
(5) The deadload happens
Technorati Tags: OpenSolaris
Posted at 04:35下午 四月 12, 2006 by raymond in Sun | 评论[0]