Kacheong Poon's Weblog
潘嘉昌的Weblog
All | General | Solaris

20050614 星期二 六月 14, 2005

Magic ndd(1M) tunables Magic ndd(1M) tunables

We often got requests from customers asking the meaning of some IP/TCP/UDP/... ndd(1M) parameters and when to change them.  One common reason is that some of those parameters are thought to be secret magic knobs for improving network performance.  But in reality, nearly all of those parameters are not supposed to be changed at all.  They are there just in case of abnormal situations.  Now that OpenSolaris is available, the truth of the use of those parameters is finally revealed!  People can just look at the code and comments!  I'll describe one  TCP ndd(1M) parameter, tcp_use_smss_as_mss_opt, added in Solaris 10 as an example to illustrate this.

You can find the following piece of code in the usr/src/uts/common/inet/tcp/tcp.c file (you need to scroll down a little to find it).

                        case TCPS_SYN_RCVD:
flags |= TH_SYN;

/*
* Reset the MSS option value to be SMSS
* We should probably add back the bytes
* for timestamp option and IPsec. We
* don't do that as this is a workaround
* for broken middle boxes/end hosts, it
* is better for us to be more cautious.
* They may not take these things into
* account in their SMSS calculation. Thus
* the peer's calculated SMSS may be smaller
* than what it can be. This should be OK.
*/
if (tcp_use_smss_as_mss_opt) {
u1 = tcp->tcp_mss;
U16_TO_BE16(u1, wptr);
}

The code above is executed when a TCP end point is in SYN-RECEIVED state (defined as TCPS_SYNC_RCVD in the code) and it is composing the SYN/ACK segment in response to an incoming SYN segment.  The variable wptr in the above code is a pointer to where the TCP MSS (Maximum Segment Size) option value in the SYN/ACK is.  So if the ndd parameter tcp_use_smss_as_mss_opt is set to a non zero value, the TCP MSS option value will be set to tcp->tcp_mss.  The field tcp_mss in the tcp_t structure is the actual sending MSS size, which is calculated using the TCP MSS option value in the received SYN segment, the length of additional TCP options in each segment, the outgoing networking interface's MTU size and possibly IPsec header overhead.  The default value of tcp_use_smss_as_mss_opt is 0.  So why does one want to use the sending MSS size as the advertised TCP MSS option value?  The MSS option value is supposed to mean the maximum segment size the local TCP end point can receive, not send.

The reason is briefly described in the comments above the code.  We introduced this parameter to get around some broken middle boxes.  Without this parameter, the local TCP end point uses the outgoing network interface's MTU size to calculate the TCP MSS option value since how big a segment the local TCP end point can receive is determined by the MTU size of the network interface.  If the network interface is a normal Ethernet card, the MSS option value is 1460.  Note that this value is independent of the MSS option value in the SYN segment advertised by the other side of a connection.  The other side should also use the same method to calculate the TCP MSS option value.  Both sides of a connection then calculate the correct send MSS size based on the above fact.  Everything should work as expected...

The problem comes when there is a broken middle box, such as a DSL modem/router using PPPoE.  Suppose machine A has a normal Ethernet interface but it is connected to the Internet using DSL with PPPoE.  A's TCP stack may not know about the reduced MTU size because of PPPoE.  This is usually not a problem as path MTU discovery can handle this issue.  But if the DSL modem/router is broken and does funny things, we got into trouble.  One funny thing is that it may modify the MSS option value A sends out.  The value A sends out should be 1460, but the modem/router can reset it to a lower value based on the PPPoE overhead.  By doing this, the modem/router thinks that it helps solve the path MTU issue.  Thus it can now forget about the need of path MTU discovery and not send any ICMP messages required for path MTU discovery to work. 

Suppose A is trying to talk to machine B, which also uses Ethernet interface.  The modem/router changes A's TCP MSS option value to X but it does not change the MSS option value (should be 1460) in B's SYN/ACK.  While B will not send a segment larger than X bytes to A, A can send a full 1460 bytes segment to B.  These full 1460 bytes segments to B will be dropped by the modem/router.  And since the modem/router does not participate in path MTU discovery, A's TCP stack will never know about the problem and the connection will just hang.

We have some customers experiencing exactly this problem with their clients who are behind such broken middle boxes.  While their clients can connect to our customers' servers, the clients cannot do any transactions as data sent to those servers are dropped by those middle boxes.  One work around is to lower the MTU size of our customers' servers.  Then the calculated TCP MSS option value will also be smaller.  This is not optimal as not all of their clients are behind such broken middle boxes.  For those clients not behind such broken middle boxes, this work around will reduce the network performance to our customers' servers as they cannot send full size segments. 

We introduced the tcp_use_smss_as_mss_opt parameter to work around this problem.  In the above case, if the tunable is set to 1, Solaris TCP (as B) will use X as the TCP MSS option value.  Then A will only send segments as large as X to B.  And for those clients not behind such broken middle boxes, they can send full size segments.

If there is no such broken middle box, there is really no need to have tcp_use_smss_as_mss_opt parameter...  There are other ndd(1M) parameters which were introduced for similar unusual circumstances.  They are not secret magic knobs.



Technorati Tag:
Technorati Tag:

( 6月 14 2005, 11:19:40 上午 HKT ) Permalink Comments [0]

Trackback URL: http://blogs.sun.com/kcpoon/entry/magic_ndd_1m_tunables
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

Archives
Language
Links
Referrers