Synchronicity
Random thoughts from a random engineer
Archives
« October 2009
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today
XML
Search

Links
Referrers

Today's Page Hits: 12

All | General | Hacks
20090202 Monday February 02, 2009
Flows observability

Crossbow provides the ability to create flows and assign properties such as bandwidth limit and priority to them. The flowadm(1m) manpage has examples and details on how to manage flows.

Once the desired flows are created, you could verify that they are working correctly by running the following command to display per-flow statistics:

flowadm show-flow -s

The output should look something like:

FLOW        IPACKETS  RBYTES      IERRORS OPACKETS  OBYTES      OERRORS
icmpflow    4         392         0       4         392         0
tcpflow     83892     122305400   0       10506     568128      0
udpflow     0         0           0       0         0           0

The above does not tell you which processes are consuming/generating packets. To answer this question, I came up with the following dtrace script:

#!/usr/sbin/dtrace -qs

mac_rx_srs_process:entry
/(((mac_soft_ring_set_t *)args[1])->srs_type & 0x02) != 0/
{
	self->s = 1;
	self->rf = (flow_entry_t *)
	    ((mac_soft_ring_set_t *)args[1])->srs_flent;
}

sched:::wakeup
/self->s/
{
	@inbound[args[1]->pr_fname, self->rf->fe_flow_name] = count();
}

mac_rx_srs_process:return
/self->s/
{
	self->s = 0;
}

mac_flow_lookup:entry
/args[2] == 0x02/
{
	self->tfp = (flow_entry_t **)args[3];
}

mac_flow_lookup:return
/args[1] == 0 && self->tfp != NULL &&
 ((*self->tfp)->fe_type & 0x10) != 0/
{
	@outbound[execname, (*self->tfp)->fe_flow_name] = count();
	self->tfp = NULL;
}

END
{
	printf("Inbound:\n");
	printa("%-20s %-20s\n", @inbound);
	printf("\nOutbound:\n");
	printa("%-20s %-20s\n", @outbound);
}

To use the script, start some traffic-generating processes and leave the script running for a while, then Ctrl-C the script. The output should look like:

Inbound:
sched                tcpflow
in.routed            icmpflow
ping                 icmpflow
netserver            tcpflow
iperf                tcpflow
Outbound:
iperf                tcpflow
sched                icmpflow
ping                 icmpflow
netserver            tcpflow
sched                tcpflow

The script above attempts to generate a mapping between processes and flows. Due to the asynchrony of the network stack, this cannot always be reliably done. The presence of the 'sched' process indicates that part of processing was handled anonymously by a kernel thread (e.g. interrupt/worker thread). 

We will have finer-grained observability features in the next phase of crossbow and hopefully there will be simpler and more reliable ways of obtaining similar information.

 


Feb 02 2009, 12:33:13 AM PST Permalink

20060509 Tuesday May 09, 2006
Network Stack Simulator
I've had a fascination with virtual machines/emulators for many years. Ages ago, when I took my first OS class, I had the fortunate opportunity of playing with and building various pieces of an OS emulator. That class piqued my interest in operating systems and greatly influenced my career direction. Around the spring of 2000, I had to come up with a project idea for one of my networking classes. I wanted to attempt something significant that time. I decided on implementing a network stack simulator.

The goal of the project was to simulate communication between multiple virtual kernels, which may reside within the same physical machine or spread across the network. These virtual kernels were not full-fledged virtual machines -- they did not simulate hardware; they did not have the usual OS functionality (scheduling, VM..etc). All they had was a network stack. The virtual kernels were passive entities that had to be driven by commands (in the form of system calls) sent to them from apps running on the host OS. These host OS apps were linked to a special library that would direct certain sockets API syscalls to a virtual kernel instead of the underlying OS. This library communicates with a virtual kernel via a pseudo RPC mechanism; This allows an app linked to this library to, for example, drive a virtual kernel running on a different physical machine. The network stack was based on the 4.4BSD codebase. This choice was mainly due to the availability of documentation (TCP/IP Illustrated Vol. 2) and the simplicity of the codebase relative to other modern OSes.

A number of technical obstacles had to be overcome in order to realize the above architecture. Not the least of which was determining what exactly could be borrowed from 4.4BSD and what had to be reimplemented in userland. This was a painstaking process involving studying each and every file related to the network stack and identifying the their dependencies on the rest of the 4.4BSD kernel. In the end, it turned out that the bulk of the code could be reused without modifications. The crucial pieces that had to reimplemented were:

Dynamic memory allocation - BSD made use of a data structure called mbufs for holding packet data. There was a wide variety macros for manipulating mbufs; some of which assumed the mbuf to be aligned in a certain way. some of which assumed the mbuf was allocated from a large contiguous region carved into page-sized chunks. These semantics were all emulated, albeit non-trivially, using vanilla malloc().

Interrupt priority manipulation - The functions splxxx() (where xxx was a priority level) were used throughout the 4.4BSD kernel for providing mutual exclusion to data structures. The idea was that, by raising the interrupt priority level temporarily, a thread processing certain data structures could prevent other entities that share the same data structures from preempting it. This was emulated using a global mutex and a counter.

Timers - 4.4BSD did have its implementation of callouts; but I chose not to use it. Instead, a dedicated thread was used to invoke a number of hardcoded callbacks at periodic intervals. This was in fact sufficient for 4.4BSD's network stack. This would likely not work for other OSes due to their heavier dependence on timers.

NIC Driver - only one driver was ported to userland and it was stripped of all of its hardware dependent code. The driver was made to send/receive ethernet frames to/from a configurable IP multicast group. The use of multicast obviated the need for implementing a separate mechanism for steering packets between virtual kernels.

The above pieces and the portable BSD stack code comprised about 80% of the virtual kernel implementation. The other 20% consisted of the syscall layer and the logic for processing syscalls received from remote clients. For the syscall layer, the bulk of the work was related to isolating the syscall code and severing dependencies to process-context related routines and data structures. The processing of remote syscalls was handled by a dedicated thread called the proxy thread. When a remote app attaches to a virtual kernel, a proxy thread and its context would be created. The attach occurs at the first invocation of a sockets syscall and was transparent to the app. The proxy thread's operation could be described as a loop consisting of the tasks: receive syscall message, decode message, invoke syscall with decoded arguments, send syscall result. When the remote app terminates, the associated connection teardown would cause the virtual kernel to implicitly invoke the exit syscall, which would cleanup the proxy thread and any data structures (file descriptors, sockets..etc) created during its lifetime.

The implementation contained about 30000 lines of code from 4.4BSD; most of which were unmodified or only slightly modified. I wrote around 3000 lines myself for tying all the pieces together. This work took about one month to complete and it was one of the most rewarding projects I've worked on.

May 09 2006, 07:37:17 PM PDT Permalink Comments [1]

20060503 Wednesday May 03, 2006
Introduction
I have been at Sun for about 5.5 years now. I currently belong to the kernel networking group within Solaris. My specialty is in the systems aspects of networking, particularly performance and virtualization. Prior to joining Solaris I was part of the Clustering group, where I designed and implemented APIs and drivers for high-speed interconnects.

In my next few entries, I'll talk more in depth about the various interesting projects I've been involved in.

May 03 2006, 07:35:24 PM PDT Permalink Comments [3]