Friday Jun 17, 2005

A race erased
Race conditions can be difficult to debug since they represent
a sequence of events that have to happen in a particular order for the
condition to manifest itself. This blog describes the debugging
process which revealed the sequence of events that lead to the race,
how this also allowed to reproduce the problem (with some
instrumentation)  and ultimately allowing the verification of the fix.

Bug 4986151 was filed some time ago stating that a  internal test
suite was hanging while executing the following command:

ifconfig qfe0 inet6 up

The key, though, was that this hang would happen once in a while (maybe
once in 30 times).

Thirumalai (our resident networking guru) quickly debugged the dump
and pointed out that there was a missing "reference release" that was
causing the hang. Thinking that this would be an easy fix, I happily took
this bug  on; only to have Thirumalai throw a curveball- "You need to find
out the exact set of circumstances that caused this and reproduce this
so you can verify the fix" :-). So much for an easy fix...

But I was up for the challange, so dove into the dump with mdb.
First I took a look at the state of the "network interface" qfe0 under
question:

> 30003802fe8::print ill_t ill_ipif|::print ipif_t ipif_flags         
ipif_flags = 0x8020000

ILL in the ill_t stands for "IP lower level" and represents "inteface
endpoint" as seen by IP. Each ILL can have "ipifs" which represent
the logical instances (eg: qfe0:1) of the physcical interface.

The address of the ILL (i.e 30003802fe8) is obtained by walking the
ILLs via mdb as seen below

So it can be seen that the IFF_NOFAILOVER and the IFF_NOLOCAL
flags are set, looking at the values in if.h

The IFF_NOLOCAL flag is set by ifconfig during the duration of DAD
(duplicate address detection) when bringing the interface 'up'.

Looking closely at the ifconfig source it can been seen that DAD calls
SIOCSLIFFLAGS (the hung ioctl in question) twice. Once to set the
IFF_NOLOCAL|IFF_UP flags (so that it can do the DAD operation by
sending/recieving packets) and then the second time to restore the
original flags (once DAD is completed). SIOCSLIFFLAGS is an ioctl
used to set flags on a network interface.

It is the second SIOCSLIFFLAGS that is actually hung. To understand
this one needs to understand that 'ifconfig <interface> up' actually
triggers first a down on the interface and then an up (in the kernel).

> 30003802fe8::print ill_t ill_dl_up
ill_dl_up = 1
The ill_dl_up being set to TRUE and the ipif_flags being set to
IFF_NOLOCAL indicate that the first SIOCSLIFFLAGS operation was
successful completely.

The ill_dl_up is set in latter part of the up
operation i.e in ip_rput_dlpi_writer (called in response to the
DL_BIND_REQ sent down by ill_dl_up).

When the second SIOCSLIFFLAGS is issued, first the down on the
interface is initiated. The ipif_down function, clears the IPIF_UP
flag, destroys the multicast memberships, deletes IREs associated
with the interface, clears NCEs (neighbor cache entries) and then checks if
the ipif is quiescent (i.e no outstanding refrences on the ILL).

It is the last step that causes the command to hang since the ILL has an
has an outstanding reference count and the ioctl message is queued for
later processing (once the reference count drops to zero but it never does
because of the missing reference release). Note all of what's stated in the
paragraph can be observed to be true by looking at the state of the ipif/ill
the dump (indicating that the down had progressed till the ipif_is_quiescent
check)

> 30003802fe8::print ill_t ill_ipif_up_count
ill_ipif_up_count = 0
30003802fe8::print ill_t ill_ipif|::print ipif_t ipif_flags
ipif_flags = 0x8020000 (note: IFF_UP is 0x1 which is not set)
30003802fe8::print ill_t ill_ipif|::print ipif_t ipif_multicast_up
ipif_multicast_up = 0
30003802fe8::print ill_t ill_nce_cnt                       
ill_nce_cnt = 0
So, to summarize, here is the chronological sequence of events that trigger the
hang:

thread I (the ifconfig thread):

1> 'ifconfig qfe0 inet6 up' triggers the DAD. DAD sets IFF_NOLOCAL
and IFF_UP (via SIOCSLIFFLAGS) in order to send/recieve packets
related to DAD.

2> Setting the IFF_UP flag actually causes the DL_BIND exchange
between IP and the qfe driver and this allows packets to be
sent/recieved.

3> ifconfig completes DAD

<thread II (see below) executes somewhere here>

4> ifconfig issues the second SIOCSLIFFLAGS
to restore the flags. This first intiates the ipif to be brought down
(via ipif_down) and this involves (in order):

a> clearing the IPIF_UP flag and the ill_ipif_up_count
b> destroying the mcast memberships
c> deleting the IREs associated with the interface
d> bringing NDP down (i.e deleting the NCEs)
e> checking if the ipif is quiescent
f> queuing the request if ipif is not quiescent (and returns EINPROGRESS)


thread II (the inbound ICMP thread):

1> An ICMP packet arrives on qfe0 and is one to which there is a
reply, so it ends up in ip_output_v6. However, there is no IRE_CACHE
for this destination so ip_newroute_v6 is called. Note, IRE stands for
Internet Route Entry and every address in the system has an an
internal IRE structure associated with it. IRE_CACHE represent
those "resolved entries" i.e there is complete information available
to send a packet whose destination address is available as an IRE_CACHE.
ip_newroute_v6 intiates the process to create such a complete
entry if one doesn't exist by contacting an external resolver like
ARP for example to get the MAC address of the destination in
question.

2> An ire_ftable_lookup_v6 is done to find a matching route that can
reach the destination

3> A corresponding destination ILL is looked up

4> An attempt is made to locate an appropriate source IRE but this
fails (ipif_select_source_v6 returns NULL since IFF_NOLOCAL is set
on the ipif) and the code branches to the icmp_err_ret failure path

5> In the failure path, ip_hdr_complete_v6 fails since there is no
local/loopack v6 IREs and the reference release on the ILL is not
done. That is,  it leaves the ILL with an outstanding reference
count.

Thread II executes somewhere between step 3 and 4 of thread
I.

To reproduce the problem (and to confirm the above theory), ifconfig
is intrumented such that a delay is introduced between the first
and the second SIOCSLIFFLAGS during the dad operation (in do_dad).
This opens the race window in which the failure path can be
exercised. To exercise the failure path a local packet (say a v6 ping)
is sent to remote link local destination.

On solar3 (the test system) we run the instrumented ifconfig.

solar3# /ifconfig.test qfe0 inet6 plumb
solar3# /ifconfig.test qfe0 inet6 up
ifconfig sleeping for 30 secs
In the 30 sec window a ping is intitiated:
solar3# ping fe80::a00:20ff:fec4:c593
^C

The ping is interrupted after a few seconds.

With ip debug turned on, the following messages are seen:

----
Mar 15 17:20:35 solar3 ip: [ID 844449 kern.notice] ip_newroute_v6: ire_ftable_lookup_v6() returned ire 300017d0c28, sire 0
Mar 15 17:20:35 solar3 ip: [ID 357465 kern.notice] ipif_select_source_v6(qfe0, fe80::a00:20ff:fec4:c593) -> NULL
Mar 15 17:20:35 solar3 ip: [ID 787518 kern.notice] ip_newroute_v6: no src for dst fe80::a00:20ff:fec4:c593
Mar 15 17:20:35 solar3 , 
Mar 15 17:20:35 solar3 ip: [ID 481217 kern.notice] ip_newroute_v6: interface name qfe0
Mar 15 17:20:35 solar3 ip: [ID 558731 kern.notice] ip_newroute_v6: no route
Mar 15 17:20:35 solar3 ip: [ID 354122 kern.notice] ip_hdr_complete_v6:
no source IRE
snipping several ping retries>
Mar 15 17:20:59 solar3 ip: [ID 326427 kern.notice] ipif_down: need to wait, adding pending mp qfe0 ill 3000164d4a8 Mar 15 17:20:59 solar3 ip: [ID 519129 kern.notice] ipif_down returns 150 err ---

and the ifconfig hangs as expected. Looking at the the live system:

solar3# mdb -k
Loading modules: [ unix krtld genunix sd ip usba nca nfs random ptm logindmux ]
> ::walk ill
3000164d728
3000164d4a8
3000164d4a8::print ill_t ill_isv6 ill_refcnt
ill_isv6 = 1
ill_refcnt = 0x4


We see that the ill_refcnt is non-zero. It is 0x4 in this case since
ping tries 20 times to send a packet before giving up (and the ping
was interrupted a few seconds after it is started)

Technorati Tag:
Technorati Tag:
Technorati Tag: