Tuesday June 14, 2005
|
Ramblings from the Mountains Michael Hunter's Weblog |
Socket option processingAs part of Solaris 10 (and my first project at Sun) I upgraded the IPv6 APIs to RFC 3493 (Basic Socket Interface Extensions for IPv6) and RFC 3542 (Advanced Sockets Application Program Interface (API) for IPv6). The main part of this project involved adding or changing available socket options. I thought a walk through socket option processing would provided some insight into the command side of the control plane of the Solaris TCP/IP stack. Given my interest in IPv6 I'll be tracking a call to setsockopt(sd, IPPROTO_IPV6, IPV6_HOPLIMIT, &option, sizeof(option)); I'll point out side trips to where other argument combinations might take you. Its easy to generate a dtrace script which will trace setsockopt(). dtrace: pid 100887 exited with status 1 CPU FUNCTION 0 -> setsockopt 0 -> getsonode 0 -> getf 0 -> set_active_fd 0 <- set_active_fd 0 <- getf 0 <- getsonode 0 -> mutex_owned 0 <- mutex_owned 0 -> sotpi_setsockopt [...] We find setsockopt() here. Don't confuse this with the routine which you link against. This is the routine which receives control in the kernel.
int
setsockopt(int sock, int level, int option_name, void *option_value,
socklen_t option_len, int version)
{
struct sonode *so;
intptr_t buffer[2];
void *optval = NULL;
int error;
dprint(1, ("setsockopt(%d, %d, %d, %p, %d)\n",
sock, level, option_name, option_value, option_len));
if ((so = getsonode(sock, &error, NULL)) == NULL)
return (set_errno(error));
[... input validation and retrieval ...]
error = SOP_SETSOCKOPT(so, level, option_name, optval,
(t_uscalar_t)option_len);
[... error handling and exit code ...]
return (0);
}
I've compressed the first few lines and removed some code to make the point of our investigation clearer. The first thing that this routine does is to print out some debugging. With the existence of dtrace we should be seeing fewer of these types of debugging statements exist in the future. By using the source code browser you can look at the various routines and data types being used here. Focus first on 'struct sonode'. If you follow the link to the data type you will find a long comment talking about the use of a sonode (it represents the socket) and how it is used. The first functional thing we do is to call getsonode() which uses the fd to retried the file structure which holds the vnode and thus leads to the sonode. I'm going to skip the code which validates and stores the options passed into setsockopt(). Instead we are going to focus in on SOP_SETSOCKOPT(). The definition of this macro is:
#define SOP_SETSOCKOPT(so, level, optionname, optval, optlen) \
((so)->so_ops->sop_setsockopt((so), (level), (optionname), \
(optval), (optlen)))
What we get from this is that we are using the jump table stored on the sonode to call the socket specific setsockopt function. Jump tables can make debugging fun. But dtrace makes figuring out what function is ultimately being called a little easier (at least in specific cases). If we look up at our dtrace output we find the obviously named culprit sotpi_setsockopt(). Before looking at that we should look around a little to see where this is used. Obviously its called via the jump table, but how does it get into that jump table? A search turns it up in socktpi.c and ncafs.c. At this point we want to know if this is the only function that processes socket options. A search on sonodeopts_t (the jump table type) turns up another table in socksctp.c. Question answered. In this case we can continue on down the common setsockopt() path and remember that there is another path followed by SCTP and NCA that we will have to investigate some other time. So back to sotpi_setsockopt(). This function is an interesting mash of history and standards driven need to demultiplex options processing. If we look at the structure of this code our eye is first drawn to the two large case statements. Note that the first of these handle (in some cases partially) socket or tcp level options in common cases. The second large case statement fixes up SOL_SOCKET level options which fail. Its the bit of code inbetween those two statements which packages up the socket option and its related data and sends them to the transport specific routines.
optmgmt_req.PRIM_type = T_SVR4_OPTMGMT_REQ;
optmgmt_req.MGMT_flags = T_NEGOTIATE;
optmgmt_req.OPT_length = (t_scalar_t)sizeof (oh) + optlen;
optmgmt_req.OPT_offset = (t_scalar_t)sizeof (optmgmt_req);
oh.level = level;
oh.name = option_name;
oh.len = optlen;
mp = soallocproto3(&optmgmt_req, sizeof (optmgmt_req),
&oh, sizeof (oh), optval, optlen, 0, _ALLOC_SLEEP);
/* Let option management work in the presence of data flow control */
error = kstrputmsg(SOTOV(so), mp, NULL, 0, 0,
MSG_BAND|MSG_HOLDSIG|MSG_IGNERROR|MSG_IGNFLOW, 0);
mp = NULL;
mutex_enter(&so->so_lock);
if (error) {
eprintsoline(so, error);
goto done;
}
error = sowaitprim(so, T_SVR4_OPTMGMT_REQ, T_OPTMGMT_ACK,
(t_uscalar_t)sizeof (struct T_optmgmt_ack), &mp, 0);
if (error) {
eprintsoline(so, error);
goto done;
}
The code before soallocproto3() marshalls the options and their data into a structure. soallocproto3() does as we would guess. It allocates some space and copies the data passed. SOTOV returns the vnode associated with a sonode. kstrputmsg() puts the resulting message onto the stream associated with the socket. We will ignore for a moment what happens with that message. sowaitprim() waits for a response to that message. This is a blocking operation (as one would expect given the blocking semantics of setsockopt()). You could figure out what transport specific routine is handling this socket option in various ways, but given we already have some dtrace output I would tend to use that. A ways down the dtrace output you will find: [...] 0 <- soallocproto3 0 -> kstrputmsg 0 -> strput 0 -> mutex_owned 0 <- mutex_owned 0 -> canputnext 0 <- canputnext 0 -> strmakedata 0 <- strmakedata 0 -> mblk_setcred 0 -> crhold 0 <- crhold 0 <- mblk_setcred 0 -> putnext 0 -> mutex_owned 0 <- mutex_owned 0 -> mutex_owned 0 <- mutex_owned 0 -> icmp_wput 0 <- icmp_wput 0 -> icmp_wput_other 0 -> snmpcom_req 0 -> mi_offset_param 0 <- mi_offset_param 0 <- snmpcom_req 0 <- icmp_wput_other 0 -> snmpcom_req 0 -> mi_offset_param 0 <- mi_offset_param 0 <- snmpcom_req 0 <- icmp_wput_other 0 -> svr4_optcom_req [...] So it appears that after a fair bit of work we end up at icmp_wput(). We have M_PROTO data so we fall into the default case of the first switch and call icmp_wput_other() (also confirmed by our trace). In icmp_wput_other() we fall into the first case and need to remember the type of the request we sent down. If we look back to where we built it in sotpit_sockopt() (code listing above) we notice its a T_SVR4_OPTMGMT_REQ. At this point we call svr4_optcom_req. We skip over the first block which is looking at M_CTLs. We have a T_NEGOTIATE (see earlier code block) so we skip past the parameter checking switch statement and the following T_DEFAULT case. Then we enter a for loop checking parameters. The interesting thing to look at is opt_chk_lookup(). What the heck is this doing?
/* Find the option in the opt_arr. */
if ((optd = opt_chk_lookup(opt->level, opt->name,
opt_arr, opt_arr_cnt)) == NULL) {
If we look back up at the top of svr4_optcom_req() we see opdes_t *opt_arr = dbobjp->odb_opt_des_arr; I'm going to jump to the opdes_t relevant to our current discussion. Its at icmp_opt_arr. It would take some time to discuss the various elements of opdes_t but suffice to say that it codifies each socket option and its attributes. So after checking the options against the information we have we end up in the switch statement for completing the action. After some setup we call "setfn". We could figure out this is icmp_opt_set() from out dtrace work, from code examination, or (after some experience) from realizing that is the likely name due to its function. At this point we use the information from the original setsockopt() to determine which code block handles IPV6_HOPLIMIT. We end up in the block which looks like:
case IPV6_HOPLIMIT:
if (inlen != 0 && inlen != sizeof (int))
return (EINVAL);
if (checkonly)
break;
if (inlen == 0) {
ipp->ipp_fields &= ~IPPF_HOPLIMIT;
ipp->ipp_sticky_ignored |= IPPF_HOPLIMIT;
} else {
if (*i1 > 255 || *i1 < -1)
return (EINVAL);
if (*i1 == -1)
ipp->ipp_hoplimit = icmp_ipv6_hoplimit;
else
ipp->ipp_hoplimit = *i1;
ipp->ipp_fields |= IPPF_HOPLIMIT;
}
if (sticky) {
error = icmp_build_hdrs(q, icmp);
if (error != 0)
return (error);
}
break;
It wasn't my intention to talk too much about IPV6_HOPLIMIT specifically. The complexity of such a simple thing is created by the API specification which allows for one to set this via IPV6_PKTINFO, via IPV6_HOPLIMIT, or by default. I've run out of time to talk about the rest of the processing, walk you back up the chain or to explain what happens for options which are aimed at a layer above where they are handled. But the basic routines and data structures have been introduced. If you need to fix bugs in this area or add functionality I hope this helps you get started. Technorati Tag: OpenSolaris Technorati Tag: Solaris Technorati Tag: DTrace ( Jun 14 2005, 07:59:58 AM PDT ) Permalink Comments [0] |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||