How an IP Tunnel Interface Dynamically Adjusts its Link MTU
Tunnel MTU Discovery
With the launch of OpenSolaris
comes the opportunity to discuss the implementation details behind
existing Solaris features. I'd like to share some of the details behind one
of my contributions to Solaris 10; the implementation of dynamic MTU
calculation for IP tunnel interfaces.
Solaris 8 was the first version of Solaris that implemented the IP in
IP tunneling mechanism described in
RFC1853. It did not,
however, implement the "Tunnel MTU Discovery" section of this RFC.
Tunneling over IPv6 (RFC2473) was implemented very
early in Solaris 10 (and backported to Solaris 9 in Update 1) along with a
Tunnel MTU Discovery mechanism that worked for IPv6 tunnel interfaces only.
Some mechanism was needed that worked for both IPv4 and IPv6 tunnels, and
that was visible to the administrator. One drawback to the IPv6
tunnel implementation of Tunnel MTU Discovery for IPv6 tunnels was that there
was no observability into the Tunnel MTU (ifconfig's output always showed
some static MTU value that was unrelated to the actual tunnel
interface's MTU).
This work became more important when customers (internal and external to
Sun) started using Solaris' IPsec tunneling to implement VPN solutions. Without
proper Tunnel MTU Discovery, things like TCP MSS calculations can take
longer to converge to usable values and protocols that don't have any
insight into Path MTU (UDP for example) yield unecessary amounts of IP
fragmentation. For more on the benefits of Tunnel MTU Discovery, see the
two aformentioned RFC's on IP tunneling.
Without going into too much detail about the inner workings of the ip
and tun modules or every line of code that was changed to implement this
feature, I'd like to focus on two aspects of the implementation. The first
is the mechanism used by the tun module to obtain path MTU information about
the tunnel destination from ip, and the second is the mechanism by which the
ip interface's MTU is dynamically changed when the tun module detects a change
in the tunnel's link MTU.
IRE_DB_REQ_TYPE
In order for the tun module to be able to calculate a useful tunnel MTU, it
needs to know the Path MTU of the tunnel destination. The tunnel destination
is the IP node we'll send encapsulated packets to when sending them through
the tunnel interface. In ifconfig output, it is the "tunnel dst":
# ifconfig ip.tun0
ip.tun0: flags=10008d1<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST,IPv4> mtu 1480 index 4
inet tunnel src 11.0.0.1 tunnel dst 11.0.0.2
tunnel hop limit 60
inet 10.0.0.1 --> 10.0.0.2 netmask ff000000
In the above example, IP packets forwarded into ip.tun0 are encapsulated
into an outer IP header with a source of 11.0.0.1 and a destination of
11.0.0.2. 11.0.0.2 is the "tunnel destination".
The Path MTU to the destination is the size of the largest IP packet that can
be sent to the destination without being fragmented nor resulting in an ICMP
fragmentation needed message. The tunnel MTU of a given tunnel is the Path MTU
of the tunnel destination plus any tunneling overhead (encapsulating IP header
and perhaps IPsec headers if IPsec tunneling is being used).
The ip module keeps this Path MTU information in a per-destination cache (aka
IRE cache) table. The protocol used to keep track of this Path MTU information
is described in RFC1191.
The ip module provides a number of methods of accessing this per-destination
cache. One of them is the ire_ctable_lookup() functional interface, but because
tun and ip are separate STREAMS module and this functional interface was
previously only safe to use within the ip module's STREAMS perimeter[1], tun could not use this functional
interface.
Another method ip provides is the IRE_DB_REQ_TYPE STREAMS message. An
upstream module can send such a message down to ip, and ip will reply
with an IRE_DB_TYPE message and append a copy of the IRE[2] requested to the message (assuming the requested
IRE is found). This is the method used by the tun module. Periodically, tun
sends down this message to get the current Path MTU for its tunnel destination.
For example, it does this when sending a packet down to ip and the Path MTU
information it has expired in tun_wdata_v4().
/*
* Request the destination ire regularly in case Path MTU has
* increased.
*/
if (TUN_IRE_TOO_OLD(atp))
tun_send_ire_req(q);
DL_NOTIFY_REQ/IND and DL_NOTE_SDU_SIZE
Once the tun module has obtained the Path MTU information of the
destination, it needs to recalcule the link MTU of the tunnel interface and
notify the upper instance of ip if the MTU has changed. The ip module can
then update the IP interface's MTU accordingly. The MTU calculation is done
by the tun_update_link_mtu() function, which in turn
calls tun_sendsdusize() to notify the ip module of the new
MTU if it has changed:
/*
* Given the path MTU to the tunnel destination, calculate tunnel's link
* mtu. For configured tunnels, we update the tunnel's link MTU and notify
* the upper instance of IP of the change so that the IP interface's MTU
* can be updated. If the tunnel is a 6to4 or automatic tunnel, just
* return the effective MTU of the tunnel without updating it. We don't
* update the link MTU of 6to4 or automatic tunnels because they tunnel to
* multiple destinations all with potentially differing path MTU's.
*/
static uint32_t
tun_update_link_mtu(queue_t *q, uint32_t pmtu, boolean_t icmp)
{
tun_t *atp = (tun_t *)q->q_ptr;
uint32_t newmtu = pmtu;
boolean_t sendsdusize = B_FALSE;
/*
* If the pmtu provided came from an ICMP error being passed up
* from below, then the pmtu argument has already been adjusted
* by the IPsec overhead.
*/
if (!icmp && (atp->tun_flags & TUN_SECURITY))
newmtu -= atp->tun_ipsec_overhead;
if (atp->tun_flags & TUN_L_V4) {
newmtu -= sizeof (ipha_t);
if (newmtu < IP_MIN_MTU)
newmtu = IP_MIN_MTU;
} else {
ASSERT(atp->tun_flags & TUN_L_V6);
newmtu -= sizeof (ip6_t);
if (atp->tun_encap_lim > 0)
newmtu -= IPV6_TUN_ENCAP_OPT_LEN;
if (newmtu < IPV6_MIN_MTU)
newmtu = IPV6_MIN_MTU;
}
if (!(atp->tun_flags & (TUN_6TO4 | TUN_AUTOMATIC))) {
if (newmtu != atp->tun_mtu) {
atp->tun_mtu = newmtu;
sendsdusize = B_TRUE;
}
if (sendsdusize)
tun_sendsdusize(q);
}
return (newmtu);
}
Note, there is a cosmetic bug in the above code. The fix would be a good
starter fix for anyone wishing to be introduced to the OpenSolaris
development process. :-) The sendsdusize variable is obviously not needed
and the last if statement can be reduced to:
if (newmtu != atp->tun_mtu &&
!(atp->tun_flags & (TUN_6TO4 | TUN_AUTOMATIC))) {
atp->tun_mtu = newmtu;
tun_sendsdusize(q);
}
How does the notification between tun and ip work? It's done via a
DLPI notification mechanism that is Solaris specific. The dlpi(7P) man page
describes the mechanism as "Notification Support", and it includes support for
the asynchronous nofication of link status (up or down), SDU (send data
unit, or MTU) size, link speed, and other information. The tun modules uses the
SDU notification.
The mechanism works as follows:
When an IP interface is plumbed, the ip module sends the
underlying driver a DL_NOTIFY_REQ DLPi message. The message contains a
bitfield representing the notifications that ip is interested in. This
is done by ill_dl_phys():
/*
* Allocate a DL_NOTIFY_REQ and set the notifications we want.
*/
notify_mp = ip_dlpi_alloc(sizeof (dl_notify_req_t) + sizeof (long),
DL_NOTIFY_REQ);
if (notify_mp == NULL)
goto bad;
((dl_notify_req_t *)notify_mp->b_rptr)->dl_notifications =
(DL_NOTE_PHYS_ADDR | DL_NOTE_SDU_SIZE | DL_NOTE_FASTPATH_FLUSH |
DL_NOTE_LINK_UP | DL_NOTE_LINK_DOWN | DL_NOTE_CAPAB_RENEG);
...
ill_dlpi_send(ill, notify_mp);
- The underlying driver (tun in this case) replies with a DL_NOTIFY_ACK
containing the subset of capabilities that it support. The tun only
supports DL_NOTE_SDU_SIZE.
- When an event that triggers a change in MTU occurs, the driver (tun)
sends up a DL_NOTIFY_IND message to those DLPI users that were interested
in DL_NOTE_SDU_SIZE notifications. The tun module does this in the tun_sendsdusize() function.
- When ip receives the DL_NOTIFY_IND message containing a
DL_NOTE_SDU_SIZE notification, it updates the IP tunnel interface's MTU
accordingly, and ifconfig shows the new dynamically updated MTU!
[1] The IP Multithreading feature of the
FireEngine
project now makes it possible for other modules to use this
functional interface. Some modules such as
ipf
(IP Filter) and
nattymod (IPsec NAT traversal) already use it. The tun module can now
use it as well, which is something we plan on doing.
[2] An IRE, or internet routing entry is
a data structure internal to Solaris' IP implementation used to represent
forwarding table entries _and_ per-destination cache entries. Creation and
maintenance of IRE tables is by far the most complex (some would say overly
complex) parts of the ip module. The subject of IRE's would make for a very
lengthy blog entry on its own.
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris