
Wednesday February 04, 2009
Ben wrote a pretty nice blog on Crossbow
Ben's blog on Crossbow - great overview
Ben's blog on Crossbow - great overview
Ben wrote a great
blog on Crossbow. Thanks Ben. It gives a good overview of features and
if you want to get more details on the internals, you can read
more details on architecture or you can build a Virtual Wire - a Network in a Box which is
explained with a example
here.
(2009-02-04 23:46:55.0)
Permalink

Sunday December 14, 2008
Crossbow - Network Virtualization Architecture Comes to Life
Crossbow - Network Virtualization Architecture Comes to Life
Crossbow - Network Virtualization Architecture Comes to Life
December 5th, 2008 was a joyous occasion and a humbling one at the
same time. A vision that was created 4 years back was coming to life.
I still remember the summer of 2004 when Sinyaw
threw a challenge at me - can you Change the world? And it was
Fall of same year when I unveiled the first set of Crossbow slides to
him and Fred Zlotnik over a bottle of wine. Lot of planning and
finally ready to start but there were still hurdles in the way. We were still
trying to finish
Nemo aka GLDv3 - A high performance device driver framework which
was absolutely required for Crossbow (We needed absolute control over
the Hardware). Nemo finished mid 2005 but then Nicolas, Yuzo etc. left
Sun and went to a startup. Thiru was still trying to finish Yosemite
(the FireEngine follow on). So in short, 2005 was basically more
planning and prototyping (specially controlling the Rx rings and
dynamic polling) on my part. I think it was early 2006 when work
begin on Crossbow in earnest. Kais moved over from security group,
Nicolas was back at Sun, Thiru, Eric Cheng, Mike Lim (and of course me)
came together to form the core team (which later expanded to 20+ people
in early 2008). So it was a long standing dream
and almost three years of hard work that finally came to life when Crossbow Phase
1 integrated in Nevada Build 105 (and will be available in
OpenSolaris 6.09 release).
Crossbow - H/W Virtualized Lanes that Scale (10gigE over multiple cores)
One of key tenets of Crossbow design was the concept of H/W Virtualization
Lanes. Essentially tying a NIC Receive and Transmit ring, DMA channel,
kernel threads, kernel queues, processing CPUs together. There are
no shared locks, counters or anything. Each lane gets to individually
schedule the packet processing by switching its Rx ring independently
between interrupt mode and poll mode (Dynamic Polling). Now
you can see why Nemo was so
important because without it, stack couldn't control the H/W and
without Nemo, the NIC vendors wouldn't have played along with us in
adding the features we wanted (stateless classification, Rx/Tx rings,
etc). Once a lane is created, we can program the classifier to spread
packets based on IP addresses and port between each lane for scaling
reasons. With the multiple cores and multiple thread that seems to be
the way of life going forward and 10+ gigE of Bandwidth (soon we will
have IPoIB working as well), scaling really matters (and we are not
talking about achieving line rates on 10 gigE with jumbo grams - we
are talking about real world, mix of small and large packets, 10k of
connections and 1000s of threads).
To demonstrate the point, I captured bunch of statistics while
finishing the final touches to the data path and getting ready to beat
some world records. The table below shows mpstat output along with
packets per second serviced for the Intel Oplin (10gigE) NIC on a
Niagara2 based system. The NIC has enabled all 8 Rx/Tx rings and has 8
interrupts enabled (one for each rx ring).
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
38 0 0 6 21 3 31 1 5 12 0 86 0 0 0 99
39 0 0 2563 5506 3907 3282 28 34 1170 0 178 0 21 0 78
40 0 0 2553 5117 3948 2410 38 150 1192 0 504 1 21 0 77
41 0 0 2651 5221 4232 2011 25 53 1195 0 210 0 20 0 80
42 0 0 3078 5700 4743 2069 21 28 1285 0 125 0 22 0 78
43 0 0 3280 5837 4777 2118 19 24 1328 0 101 0 22 0 78
44 0 0 3143 19566 18801 1773 50 44 1285 0 68 0 65 0 35
45 0 0 4570 7748 6838 1984 23 27 1697 0 118 0 29 0 71
# netstat -ia 1
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
4 0 1 0 0 61284 0 128820 0 0
3 0 2 0 0 61015 0 129316 0 0
4 0 2 0 0 60878 0 128922 0 0
This
link shows the interrupt binding, mpstat and intrstat output. You
can see that the NIC is trying very hard to spread the load but
because the stack sees this as one NIC, there is one CPU (number 44)
where all the 8 threads collide. Its like a 8 lane highway becoming
single lane during rush hours.
Now lets look what happens when Crossbow enables a lane all the way up
the stack for each Rx ring and also enables dynamic polling for each
individually. If you look at the corresponding mpstat and intrstat
output and packets per second rate, you will see that the lanes
really do work independently from each other resulting in almost
linear spreading and much higher packets per second serviced. The
benchmark represents a webserver workload and needless to say,
Crossbow with dynamic polling on per Rx ring basis almost tripled the
performance. The raw stats can be seen here.
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
37 0 0 2507 11906 10272 4267 265 326 489 0 776 4 28 0 68
38 0 0 2111 11793 9840 6503 336 314 472 0 615 3 32 0 65
39 0 0 500 10409 10164 565 7 125 174 0 1413 6 23 0 70
40 0 1 660 10423 9982 950 23 288 272 0 3834 8 34 0 58
41 0 1 658 10490 10108 847 16 238 237 0 2549 8 29 0 64
42 0 0 584 10605 10299 708 12 181 207 0 1828 7 26 0 67
43 0 0 732 10829 10559 598 9 141 193 0 1485 7 25 0 68
44 0 1 306 487 25 1091 17 282 330 0 4083 9 17 0 74
# netstat -ia 1
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
2 0 1 0 0 267619 0 522226 0 0
2 0 2 0 0 275395 0 539920 0 0
2 0 2 0 0 251023 0 482335 0 0
And finally below we print some statistics from the MAC per Rx ring data
structure (mac_soft_ring_set_t). For each Rx ring, we track the number
of packets received via interrupt path, number received via poll path,
chains less than 10, chains between 10 and 50 and chains over 50 (each
time we polled the Rx ring). And you can see that polling path brings
a larger chunk of packets and in bigger chains.

Keep in mind that for most OSes and most NIC, the interrupt path
brings one packet at a time. This makes Crossbow architecture more
efficient for scaling as well as performance at higher loads on high
B/W NICs.
Crossbow and Network Virtualization
Once we have the ability to create these independent H/W lanes,
programming the NIC classifier is easy. Instead of spreading the
incoming traffic for scaling, we program the classifier to send
packets for a mac address to a individual lane. The MAC addresses are
tied to individual Virtual NICs (VNICs) which are in turn attached to
guest Virtual Machines or Solaris Containers (Zones). The separation
for each virtual machine is driven by the H/W and processed on the
CPUs attached to the virtual machine (the poll thread and interrupts
for the Rx ring for a VNIC are bound to the assigned CPUs). The
picture kind of looks like this

Since for NICs and VNICs, we always do dynamic polling, enforcing bandwidth
limit is pretty easy. One can create a VNIC by simply specifying the
B/W limit, priority, cpu lists in one shot and the poll thread will
enforce the limit by picking up only packets that meet the limit. Something
as simple as
freya(67)% dladm create-vnic -l e1000g0 -p maxbw=100,cpus=2 my_guest_vm
The above command will create a VNIC called my_guest_vm with a random MAC
address and assign it a B/W of 100Mbps. All the processing for this VNIC
is tied to CPU 2. Its features like this that makes Crossbow a integral part
of Sun Cloud Computing initiative due to roll out soon.
Anyway, this should give you a flavour. There is a white paper and more detailed
documents (including how to get started) at the
Crossbow
OpenSolaris page.
network
virtualization
crossbow
cloud computing
(2008-12-14 16:07:10.0)
Permalink

Thursday August 24, 2006
CrossBow: Solaris Network Virtualization & Resource
Control
CrossBow: Solaris Network Virtualization & Resource
Control
CrossBow: Solaris Network
Virtualization & Resource Control
1. CrossBow (the name):
It makes some sense to explain the relatonship between the technology
(Network Virtualization and Resource Control) and the project name
(CrossBow). It is believed that Crossbow was invented in 341B.C. in
China but the use became prevalent in middle ages specially when steel
was used to make the weapon. More powerful Crossbows could penetrate
the armour at 200 yards and gave the typical horse mounted knights
real nightmares. But the biggest differentiator was the simplicity in
their use. Crossbow could be used effectively after a week of
training, while a comparable single-shot skill with a longbow could
take years of practice.
Similary, if you take a look at the existing QOS mechanisms on a end
host, they are very difficult to use and normally take a very skilled
administrator to use effectively. Even then, the existing QOS
mechanism come with heavy performance penalties which is also pretty
common with any kind of virtualization as well. In Solaris land, we
have invented a new way of imposing bandwidth resource control as
attribute to a real or a virtual NIC such that it is built in as part
of the Solaris network stack and comes without any performance
penalties. Since the virtualization aspects and/or resource control
aspects are just the attributes of the NIC/VNIC (specified when a NIC or
Virtual NIC is created), a normal user and configure them without
needing a docterate in QOS or virtualization. "CrossBow" was the most
suitable name for this project since we are trying to achieve similar
results in the field of virtualization and resource control as the
weapon did in medivial times in the battlefield.
2. CrossBow (the background):
Crossbow provides the building blocks for network virtualization and
resource control by creating virtual stacks around any service (HTTP,
HTTPS, FTP, NFS, etc.), protocol (TCP, UDP, SCTP, etc.), or Virtual
machines like Containers, Xen and ldoms.
The project allows the system administrator to carve out any physical
NIC into multiple virtual NICs which are pretty similar to real NICs
and are administered just like real NICs. Each Virtual NIC can be
assigned its own priority and band-width on a shared NIC without
causing any performance degradation. The virtual NICs can have their
own NIC hardware resources (Rx/Tx rings, DMA channels), MAC addresses,
kernel threads and queues which are private to the VNIC and are not
shared accross all traffic. In case of Solaris Containers, the
Container can be assigned a virtual Stack Instance as well along with
one or more virtual NICs. As such traffic for one VNIC can be
totally isolated from other traffic and assigned any kind of limits or
guarantees on amount of bandwidth it can use.
3. Overview:
Project Crossbow extends Solaris reach in several markets.
3a. OS/Network/Server Consolidation:
The application, network and server consolidation environments where
both OS and network virtualization play a big role. This market is
typically driven by the cost of owning and managing physical machines
and physical networks. The sweet spot for these horizontally scaled
environment are the 2-4 socket machines which appear as 4-8 CPU
machines in case of x86/x64 systems and 32-64 CPU machines in case of
SUN's new Niagara based servers. From total cost of ownership
perspective, these blades have only one physical NIC (1Gb or 10Gb) but
are trying to run multiple virtual machines (Xen, Containers, ldoms)
which have to share the NIC resources and the available bandwidth.
The problem gets worse because for 3 decades we have been designing
application to go as fast as possible and any congestion control is
the job of the transport layer (if at all). So if one virtual machine
is using UDP based traffic, then other virtual machines on the same
system using TCP traffic will suffer badly. Even within same transport
(TCP for instance), bulk througput applications like ftp/http etc will
have a very negetive impact on interactive traffic and latency
sensitive applications.
The goal of the project Crossbow is to different virtual machines
share the common NIC in a fair manner and allow system administrators
to set preferential policies where necessary (e.g. the ISP selling
limited B/W on a common pipe) without any performance impact.
3b. Traditional QOS and application consolidation:
Exisiting host based QOS mechanism are very complex to setup and
typically come with a sizable performance penalty and increase in
latency. The big part of the problem is the interrupt based delivery
mechanism for inbound packets and the QOS being implemented by a
separate layer (typically between NIC driver and IP). The network and
transport layer of the host stack is unware about the QOS layer. The
packets are already delivered to the host memory by means of
interrupts and the QOS layer needs to classify the packets to various
queues before it can apply the policies. In case the packet can not be
processed because the bandwidth usage for that class is exceeded, it
sits in a queue while still consuming system memory.
Project Crossbow integrates stack virtualization and QOS as part of
the stack architecture itself to offer a large subset of QOS type
functionality at zero performance penalty and simple administrative
interfaces. It also integrates diffserv with the stack where a virtual
NIC can set and read the diffserv based labels. Since Crossbow
architecture is limited in differentiating the traffic based on layer
2, 3, and 4 headers only i.e. the VLAN tag, local mac address, local
IP address, protocol, and ports; the functionality offered is a subset
of exisiting QOS mechanism although it covers 90% of the use cases
without any performance penalty. This is the prime reason why project
Crossbow refers to the bandwidth related policies as 'Bandwidth
resource control' instead of QOS.
3c. Horizontally scaled markets:
This is the market segment made up of low priced volume servers
(typically 2-4 socket machines) which offer services which require
little or no sharing of data between them. The small servers can be
standalone machines in a rack or blades in a chasis. Grids are another
way to use volume servers to achieve the output of the traditional
large SMP machines or main frames.
In case of blades which share a common 10Gb NIC to the chasis,
Crossbow again provides the sharing of bandwidth in a fair manner. In
addition, the Crossbow provided APIs for network management,
virtualization and bandwidth resource control can be use by 3rd party
management softwares to propogate the common policy throughout the
server farm or all the blades in the chasis. In a Solaris based
homogenous environments, its very easy to mark an application or a
virtual machine (based on port or IP address) as critical and
propogate the same policy through all the machines. The diffserv
labels can be added appropriately such that the policy is honoured by
all machines and network element in the center.
4. Technical problems in exisiting architectures:
As mentioned earlier, the host based QOS systems work as a layer
between the network stack and as such are pretty inefficient in
providing the QOS services required of them. But that is not all.
The exisiting interrupt driven packet delivery model pecludes any kind
of policy enforcement and fair sharing. When a NIC interrupt is raise,
it is at a highest priority and the CPU has to context switch whatever
processing to deal with the interrupt. Most of the time, the
processing of a critical packet is interrupted to deal with the
arrival of a non critical packet.
The anonymous packet processing in the kernel is another major problem
in virtualizing the stack and enforcing any kind of bandwidth resource
control (including fairness). 80% of the work is already done for an
incoming packet when the stack determines that no one is actually
interested in the packet and it needs to drop it. In other words, the
cost of dropping unwanted packets is too high.
Everything in the host flows through common queues and is processed by
common threads which make enforcing policies based on traffic type
very difficult. Recv or xmit of each packet impacts processing on any
other packet on that particular CPU.
In most of the virtualized environments, the pseudo NIC in the virtual
machines has no way of knowing about the hardware capabilities of the
real hardware (even simple things like hardware checksum) because of
the presense of the bridge in between and ends up making negetive
performance impact. In addition, there is no mechanism to share the
NIC in a fair manner. The transition of typical packet from the dom0
to domU also causes severe performance problems.
5. CrossBow Architecture:
The Crossbow architecture starts out by integrating network
virtualization and resource control as part of the stack
architecture. The Solaris 10 network stack has already been designed
for the next decade where the connection to CPU affinity is maintained
and the upper stack has tight control over the NIC resources.
Crossbow builds on top of that by pushing the classification of
packets based on services, protocols or virtual machines as far below
as possible. If the NIC hardware itself has ability to divide onboard
memory into segements/queues (know as Rx and Tx rings) which can
preferably haev their own DMA channels and MSI-X interrupts, the stack
programs the NIC classifier to classify packets based on configured
policies to different Rx rings. Each Rx/Tx ring is owned by a CPU and
a separate kernel queue know as serialization queue which controls the
rate of packet arrival into the system based on configured bandwidth.
The Rx/Tx ring, the associated DMA channel, MSI-X interrupt, the
serialization queue, the CPU, and processing threads are all unique
for the service, protocol or virtual machine in question and can be
assigned a unique MAC address and a Virtual NIC which becomes the
administration entity that can be administered like a normal NIC. The
NIC classifier drives the incoming packets to the correct RX ring from
where the Squeue owning the Rx ring (and VNIC) will pull the packets
via polling mode based on fair sharing of resources or configured
bandwidth. The interrupt mode is used only when the Squeue has no
packets to process and the Rx ring is empty. Each individual Rx ring
is dynamically switched between interrupt and polling mode. Incoming
packets that exceed the configured bandwidth limit remain in the NIC
itself in their corresponding Rx ring and are pulled in the system
only when they are ready to be processed.
The creation of an administrative entity (VNIC) is optional and
typically associated with a virtual machine like Solaris containers,
Xen or ldoms. For application or protocol based resource control, a
separate data path is created to provide the isolation and resource
control but a VNIC is not configured.
As mentioned above the VNIC is just an administrative entity. If the
classification has already been done by the NIC to a particular Rx
ring, the packets as delivered directly to IP layer by means of
function calls when Rx ring is interrupt mode or the squeue residing
in IP layer pulls the packet chain directly from the Rx ring when in
the polling mode. In essence, the entire data link layer is bypassed
resulting in improved performance and lower latencies. If the VNIC is
placed in promiscous mode, the data link bypass is abandoned and the
Rx ring delivers packets via the VNIC layer which creates a copy of
the packet for promiscous stream. Similarly, in polling mode, the
squeues poll entry point are changed to point at VNIC which is turns
pulls the packets from Rx rings, makes a copy and then gives the chain
to the Squeue poll thread.
The entire layered architecture is built on function pointers know as
'upcall_func' and 'downcall_func' with corresponding 'upcall_arg' and
'downcall_arg' for context. Every layer provides a pointer of its recv
function as 'upcall_func' and a context as 'upcall_arg' to the layer
below. Similarly, every layer provides pointer to its transmit
function as 'downcall_func' and a context cookie as 'downcall_arg' to
layer above. This is how the packet path is constructed. Any layer can
short circuit itself out by providing the 'upcall_func' and
'upcall_arg' of the layer above to layer below (and same for transmit
side if needed). All context cookies for a layer work on reference
based system when each layer pointed to it gets a reference and ensure
that data structures don't get freed till all references are dropped.
In case, the NIC hardware does not have classification capability
(unlikely since most of intel, broadcom and SUN 1Gb NICs and pretty
much all 10Gb NICs shipping for past several years have this
capability) or have run out of the classification capability, the
architecture provides a classification capability in the mac layer and
employs soft rings which are similar to functionality as NIC hardware
classifier and RX rings. The NIC hardware layer coupled with lower MAC
layer and soft rings are termed as 'Pseudo Hardware layer'. A request
by administartor to create a new VNIC or flow will always return
successful from the pseudo hardware layer. The pseudo hardware layer
manages the hardware and software classification capability and Rx
rings and soft rings transparently from upper layers.
6. Crossbow layers, data structures and packet flow:
Its easier to illustrate this with 2 flows. The first one is for
IP_addr = a.b.c.d && TCP and it goes through normal path via Upper dls
etc. This is under the assumption that either snoop (or someone else
in DLS) is interested in this flow and we can't bypass data link
processing. The squeue poll function in this case is dls_poll_ring and
argument is dls_impl_t.
The 2nd flow is for IP_addr = m.n.o.p && port = 80 && TCP
which is unique and no one is interested in snooping it. In this case,
the dls layer allows itself to be pypassed by setting the upcall_func
and upcall_arg for soft_ring/Rx_rings to directly call into IP.
The squeue is directly polling the H/W Rx ring in this case.

7. The administrative model:
Crossbow introduces a new command called 'netrcm' and further augments
'dladm' which was introduced as part of the new high performance
device driver framework (GLDv3) in Solaris 10.
'dladm (1M)' - This is primarily used to create, modify and destroy
VNIC based on mac or IP addresses. The created VNIC is visible and
managed by ifconfig just like any otehr NIC and can get its IP address
assigned via DHCP if necessary.
The examples below can illustrate this better:
Example 1: Configuring VNICs
To create two VNICs interfaces with vinc-ids 1 and 2
over a single physical device bge0, enter the following com-
mands:
# dladm create-vnic -d bge0 1
# dladm create-vnic -d bge0 2
The new links will be called vnic1 and vnic2.
Example 2: Configuring VNICs and allocating bandwidth & priority
To create two VNIC interfaces with vinc-ids 1 and 2
over a single physical device bge0 and make vnic1 a higher
priority VNIC using factory assigned MAC address with guarantee
to use upto 90% of the bandwidth and vnic2 having a lower priority
with a random MAC address and a hard limit of 100Mbps:
# dladm create-vnic -d bge0 -m factory -b 90% -G -p high 1
# dladm create-vnic -d bge0 -m random -b 100M -L -p low 2
Example 3: Configure a VNIC by choosing a factory MAC address
To create a VNIC interface with vinc-id 1 by first
listing the factory available MAC address and then using one
of them:
# dladm show-dev -d bge0 -m
bge0
link: up speed: 1000 Mbps duplex: full
MAC addresses:
slot-ident Address In Use
1 0:e0:81:27:d4:47 Yes
2 8:0:20:fe:4e:a5 No
# dladm create-vnic -d bge0 -m factory -n 2 1
# dladm show-dev -d bge0
bge0
link: up speed: 1000 Mbps duplex: full
MAC addresses:
slot-ident Address In Use
1 0:e0:81:27:d4:47 Yes
2 8:0:20:fe:4e:a5 Yes
Example 4: Configuring VNICs sharing a MAC address
To create two VNICs with vnic-id 1 and 2 by first listing the
available factory assigned MAC addresses and then picking one
that will be shared by the newly created VNICs
# dladm show-dev -d bge0 -m
bge0
link: up speed: 1000 Mbps duplex: full
MAC addresses:
slot-ident Address In Use
1 0:e0:81:27:d4:47 Yes
2 8:0:20:fe:4e:a5 No
# dladm create-vnic -d bge0 -m shared -n 2 1
# dladm create-vnic -d bge0 -m shared -n 2 2
Example 5: Creating a VNIC with user specified MAC address
To create a VNIC with vnic-id 1 by providing a user specified
mac address
# dladm create-vnic -d bge0 -m 8:0:20:fe:4e:b8
'netrcm (1M)' - This command is primarily used to provide isolation
and private resources to an application traffic or protocol. In
addition, we can also configure bandwidth limits and guarantees for
the flows. Again some example can illustrate the usage better:
Example 1: Create a policy around mission critical port 443 traffic
which is https service.
To create a policy around inbound https traffic on a https server
so that https gets it dedicated NIC hardware and kernel TCP/IP
resources. The policy-id specified is https-1 which is used to
later modify of delete the policy.
# netrcm add-policy -d bge0 -H transport = TCP local port = 443 https-1
Example 2: Modify an existing policy to add bandwidth resource control
To modify https-1 policy to add bandwidth control and give it a
high priority
# netrcm modify-policy -d bge0 -b 90% -G -p high https-1
Example 3: Limit the bandwidth usage of UDP protocol
To create a policy for UDP protocol so that it can not consume more
than 10% of available bandwidth. The policy-id is called limit-udp-1.
# netrcm add-policy -d bge0 -b 90% -L -p low limit-udp-1
8. Crossbow Observability - Stats, history and APIs:
Apart from the functionality related to network virtualization and
bandwidth resource control, Crossbow offers a whole range of news
tools and mechanism to understand the bandwidth usage. Administrators
can see real time bandwidth usage for various VNICs or configured
flows (via 'netrcm') without causing any performance penalties.
The Rx rings and squeues dealing with a particular flow keep track of
normal stats which are pulled by a userland daemon from time to
time. The daemon also logs the information in special log files which
allows users to see history at any given time. A user can request
usage for a time period in past to understand the system behaviour.
Crossbow will provide more tools to help capacity planning by allowing
the system to be put under capacity planning mode where bandwdith
usage for top traffic is monitored and displayed.
All the observability and administrative interfaces can be accessed by
APIs which allow other applications to use and manage the system.
9. Resources:
Crossbow project page on OpenSolaris is a good source of information
http://www.opensolaris.org/os/project/crossbow
The Crossbow mailing list is where all the day to day business for the
project is conducted. Anyone can join the mailing list
crossbow-discuss@opensolaris.org.
Crossbow slide presentation can be found here
Crossbow Team members are:
* Kais Belgaied
* Stephanie Brucker
* Eric Cheng
* Nicolas Droux
* Markus Flierl
* Carol Gayo
* Mohan Iyer
* Darrin Johnson
* Michael Lim
* Rajagopal Kunhappan
* Erik Nordmark
* Ethan Solomita
* Thirumalai Srinivasan
* Sunay Tripathi
* Nicky Veitch
* Bill Watson
* Roamer Lu
Email: first.last@sun.com
networking
virtualization
crossbow
(2006-08-24 02:26:02.0)
Permalink

Sunday April 02, 2006
Project Crossbow: Network Virtualization and Resource Control going live
Project Crossbow - going live on OpenSolaris
Hello and Welcome to project Crossbow!! We are going to
add Network Virtualization and Resource Control to Solaris without degrading
performance.
At this time, we are seeking members from open solaris community to become
part of Crossbow i-team. Its the charter of i-team to gather requirements
and deliver the project including design, docs and testing. We would love
to have members of the community get involved from day one. The participation
opportunities include (but are not limited to):
- helping define the project
- gathering requirements
- designing the project
- writing code
- creating demos
- doing talks and evangalizing the project
Please send an email to me if you are interested. we can promise you
that this will be a thrilling adventure and you will be living on the
bleeding edge of technology! Project Crossbow is brought to you by
same people who created project FireEngine (new stack architecture),
project Nemo (GLDv3 - new high performance device driver framework),
project Yosemite (UDP performance), etc to name a few.
Apart from active participation, you can also participate via the
mailing lists and discussion groups where we will be posting various
documents for review and comments apart from day to day discussion.
The project Crossbow page is visible here
You can sign up for the discussion group here
(2006-04-02 20:41:15.0)
Permalink

Wednesday December 07, 2005
Nemo based e1000g on T2000
Derek Morr points out that T2000
uses e1000g controllers, which are still dlpi based, so they wouldn't
(yet) get the advantages of Nemo (GLDv3). Very good observation. The
T1000 already uses a broadcom chip which comes up as bge which is fully
Nemo based. The T2000 indeed uses a DLPI
based driver in Solaris 10 update currently. Without going
into the why (its not very pretty), the Nevada and OpenSolaris version
of e1000g is already Nemo based (BTW, the DLPI driver comes up as ipge
on T2000 which tell you that its not Nemo based). The Nemo based
patches for e1000g (for S10) should be available soon if not avialable
already. Pretty soon
the machine will ship with the patches already installed and future
updates will obviously have the Nemo version.
(2005-12-07 21:09:58.0)
Permalink

Monday November 14, 2005
Solaris Networking - The Magic Revealed (Part I)
Solaris Networking - The Magic Revealed
Many of you have asked for details on Solaris 10 networking. The
great news is that I finished writing the treatise on the subject which
will become a new section in Solaris Internals book
by Jim Mauro and Richard Mcdougall.
In the meawhile, I have used some excerpts to create a mini book (part
I and II) for Networking
community on OpenSolaris. The Part II containing the new High Performance GLDv3 based
device driver framework, tuning guide for Solaris 10, etc is below. Enjoy! As usual, comments
(good or bad) are welcome.
Solaris Networking - The Magic Revealed
(Part I)
- Background
- Solaris 10 stack
- Overview
- Vertical
perimeter
- IP classifier
- Synchronization
mechanism
- TCP
- Socket
- Bind
- Connect
- Listen
- Accept
- Close
- Data path
- TCP Loopback
- UDP
- UDP packet
drop within the stack
- UDP Module
- UDP and
Socket interaction
- Synchronous
STREAMS
- STREAMs fallback
- IP
- Plumbing NICs
- IP Network
MultiPathing (IPMP)
- Multicast
- Solaris 10 Device
Driver framework
- GLDv2 and
Monolithic DLPI drivers (Solaris 9 and before)
- GLDv3 - A New
Architecture
- GLDv3 Link
aggregation architecture
- Checksum offload
- Tuning for
performance:
- Future
- Acknowledgments
1 Background
The networking stack of Solaris 1.x was a BSD variant and was pretty
similar to the BSD Reno implementation. The BSD stack worked fine for
low end machines but Solaris wanted to satisfy the needs of low end
customers as well as enterprise customers and such migrated to AT&T
SVR4 architecture which became Solaris 2.x.
With Solaris 2.x, the networking stack went through a make over and
transitioned from a BSD style stack to STREAMs based stack. The STREAMs
framework provided an easy message passing interface which allowed the
flexibility of one STREAMs module interacting with other STREAM module.
Using the STREAMs inner and outer perimeter, the module writer could
provide mutual exclusion without making the
implementation complex. The cost of setting up a STREAM was high but
number of connection setup per second was not an important criterion
and connections were usually long lived. When
the connections were more long lived
(NFS, ftp, etc.), the cost of setting up a new stream was amortized
over the life of the connection.
During late 90s, the servers became heavily SMP running large
number of CPUs. The cost of switching processing from one CPU to
another became high as the mid to high end machines became more NUMA
centric. Since STREAMs by design did not have any CPU affinity, packets
for a particular connections moved around to different CPU. It was
apparent that Solaris needed to move away from STREAMs architecture.
Late 90s also saw the explosion of web and increase in processing power
meant a large number of short lived connections making connection setup
time equally important. With Solaris 10, the networking stack went
through one more transition where the core pieces (i.e. socket layer,
TCP, UPD, IP, and device driver) used an IP Classifier and
serialization queue to improve the connection setup time, scalability,
and packet processing cost. STREAMs are still used to provide the
flexibility that ISVs need to implement additional functionality.
2 Solaris 10 stack
Lets have a look at how the new framework and its key components.
Overview
The pre SOlaris 10 stack uses STREAMS perimeter and kernel adaptive
mutexes for multi-threading. TCP uses a STREAMS QPAIR perimeter, UDP
uses a STREAMS QPAIR with PUTSHARED, and IP a PERMOD perimeter with
PUTSHARED and various TCP, UDP, and IP global data structures protected
by mutexes. The stack was executed by both user-land threads executing
various system-calls, the network device driver read-side interrupt or
device driver worker thread, and by STREAMS framework worker threads.
As the current perimeter provides per module, per protocol stack layer,
or horizontal perimeter. This can, and often does, lead to a packet
being processed on more than one CPU and by more than one thread
leading to excessive context switching and poor CPU data locality. The
problem gets even more compounded by the various places packet can get
queued under load and various threads that finally process the packet.
The "FireEngine" approach is to merge all protocol layers into one
STREAMs module which is fully multi threaded. Inside the merged module,
instead of using per data structure locks, use a per CPU
synchronization mechanism called "vertical perimeter". The "vertical
perimeter" is implemented using a serialization queue abstraction
called "squeue". Each squeue is bound to a CPU and each connection is
in turn bound to a squeue which provides any synchronization and mutual
exclusion needed for the connection specific data structures.
The connection (or context) lookup for inbound packets is done outside
the perimeter, using an IP connection classifier, as soon as the packet
reaches IP. Based on the classification, the connection structure is
identified. Since the lookup happens outside the perimeter, we can bind
a connection to an instance of the vertical perimeter or "squeue" when
the connection is initialized and process all packets for that
connection on the squeue it is bound to maintaining better cache
locality. More details about the vertical perimeter and classifier are
given later sections. The classifier also becomes the database for
storing a sequence of function calls necessary for all inbound and
outbound packets. This allows to change the Solaris networking stacks
from the current message passing interface to a BSD style function call
interface. The string of functions created on the fly (event-list) for
processing a packet for a connection is the basis for an eventual new
framework where other modules and 3rd party high performance modules
can participate in this framework.
Vertical
perimeter
Squeue guarantees that only a single thread can process a given
connection at any given time thus serializing access to the TCP
connection structure by multiple threads (both from read and write
side) in the merged TCP/IP module. It is similar to the STREAMS QPAIR
perimeter but instead of just protecting a module instance, it protects
the whole connection state from IP to sockfs.
Vertical perimeter or squeue by themselves just provide packet
serialization and mutual exclusion for the data structures, but by
creating per CPU perimeter and binding a connection to the instance
attached to the CPU processing interrupts, we can guarantee much better
data locality.
We could have chosen between creating a per connection perimeter or a
per CPU perimeter i.e. a instance per connection or per CPU. The
overheads involved with a per connection perimeter and thread
contention gives lower performance and made us choose a per CPU
instance. For a per CPU instance, we had the choice of queuing a
connection structure for processing or instead just queue the packet
itself and store the connection structure pointer in the packet itself.
The former approach leads to some interesting starvation scenarios
where packets for a connection keep arriving and to prevent such a
situation, the overheads caused a lowered performance. Queuing the
packet themselves allows us to protect the ordering and is much simpler
and thus the approach we have taken for FireEngine.
As mentioned before, each connection instance is assigned to a single
squeue and is thus only processed within the vertical perimeter. As a
squeue is processed by a single thread at a time all data structures
used to process a given connection from within the perimeter can be
accessed without additional locking. This improves both the CPU and
thread context data locality of access of both the connection meta
data, the packet meta data, and the packet payload data. In addition
this will allow the removal of per device driver worker thread schemes
which are problematic in solving a system wide resource issue and allow
additional strategic algorithms to be implemented to best handle a
given network interface based on throughput of the network interface
and the system throughput (e.g. fanning out per connection packet
processing to a group of CPUs). The thread, entering squeue may either
process the packet right away or queue it for later processing by
another thread or worker thread. The choice depends on the squeue entry
point and on the state of the squeue. The immediate processing is only
possible when no other thread has entered the same squeue. The squeue
is represented by the following abstraction:
typedef struct squeue_s { int_t sq_flag; /* Flags tells squeue status */ kmutex_t sq_lock; /* Lock to protect the flag etc */ mblk_t *sq_first; /* First Packet */ mblk_t *sq_last; /* Last Packet */ thread_t sq_worker; /* the worker thread for squeue */ } squeue_t;
Its important to note that the squeues are created on the basic of per
H/W execution pipeline i.e. cores, hyper threads, etc. The stack
processing of the serialization queue (and the H/W execution pipeline)
is limited to one thread at a time but this actually improves
performance because the new stack ensure that there are no waits for
any resources such as memory or locks inside the vertical perimeter and
allowing more than one kernel thread to time share the H/W execution
pipelines has more overheads vs allowing only one thread to run
uninterrupted.
- Queuing Model - The queue is strictly FIFO (first in first out)
for both read and write side which ensures that any particular
connection doesn't suffer or is starved. A read side or a write side
thread enqueues packet at the end of the chain. It can then be allowed
to process the packet or signal the worker thread based on the
processing model below.
- Processing Model - After enqueueing its packet, if another thread
is already processing the squeue, the enqueuing thread returns and the
packet is drained later based on the drain model. If the squeue is not
being processed and there are no packets queued, the thread can mark
the squeue as being processed (represented by 'sq_flag'), and processes
the packet. Once it completes processing the packet, it removes the
'processing in progress' flag and makes the squeue free for future
processing.
- Drain Model - A thread, which was successfully able to process
its own packet, can also drain any packets that were enqueued while it
was processing the request. In addition, if the squeue is not being
processed but there are packets already queued, then instead of queuing
its packet and leaving, the thread can drain the queue and then process
its own packets.
The worker thread is always allowed to drain the entire queue. Choosing
the correct Drain model is quite complicated. Choices are
between,
- "always queue",
- "process your own packet if you can",
- "time bounded process and drain".
These options can be independently applied to the read thread and the
write thread.
Typically, the draining by an interrupt thread should always be
time-bounded "drain and process" while the write thread can choose
between "processes your own" and time bounded "process and drain". For
Solaris 10, the write thread behavior is a tunable with default being
"process your own" while the read side is fixed to "time bounded
process and drain".
The signaling of worker thread is another option worth exploring. If
the packet arrival rate is low and a thread is forced to queue its
packet, then the worker thread should be allowed to run as soon as the
entering thread finished processing the squeue when there is work to be
done.
On the other hand, if the packet arrival rate is high, it may be
desirable to delay waking up the worker thread hoping for an interrupt
to arrive shortly after to complete the drain. Waking up the worker
thread immediately when the packet arrival rate is high creates
unnecessary contention between the worker and interrupt threads.
The default for Solaris 10 is delayed wakeup of the worker thread.
Initial experiments on available servers showed that the best results
are obtained by waking up the worker thread after a 10ms delay.
Placing a request on the squeue requires a per-squeue lock to protect
the state of the queue, but this doesn't introduce scalability problems
because it is distributed between CPU's and is only held for a short
period of time. We also utilize optimizations, which allow avoiding
context switches while still preserving the single-threaded semantics
of squeue processing. We create an instance of an squeue per CPU in the
system and bind the worker thread to that CPU. Each connection is then
bound to a specific squeue and thus to a specific CPU as well.
The binding of an squeue to a CPU can be changed but binding of a
connection to an squeue never changes because of the squeue protection
semantics. In the merged TCP/IP case, the vertical perimeter protects
the TCP state for each connection. The squeue instance used by each
connection is chosen either at the "open", "bind" or "connect" time for
outbound connections or at "eager connection creation time" for inbound
ones.
The choice of the squeue instance depends on the relative speeds of the
CPUs and the NICs in the system. There are two cases:
- CPU is faster than the NIC: the incoming connections are assigned
to the "squeue instance" of the interrupted CPU. For the outbound case,
connections are assigned to the squeue instance of the CPU the
application is running on.
- NIC is faster than the CPU: A single CPU is not capable of
handling the NIC. The connections are bounded in random manner on all
available squeue.
For Solaris 10, the determination of NIC being faster or slower than
CPU is done by the system administrator in the form of a tuning the
global variable 'ip_squeue_fanout'. The default is 'no fanout' i.e.
Assign the incoming connection to the squeue attached to the
interrupted CPU. For the purposes of taking a CPU offline the worker
thread bound to this CPU removes its binding and restores it when the
CPU gets back online. This allows for the DR functionality to work
correctly. When packets for a connection are arriving on multiple NICs
(and thus interrupting multiple CPUs), they are always processed on the
squeue the connection was originally established on. In Solaris 10, the
vertical perimeter are provided only for TCP based connections. The
interface to vertical perimeter is done at the TCP and IP layer after
determining that it is a TCP connection. Solaris 10 updates will
introduce the general vertical perimeter for any use.
The squeue APIs look like:
squeue_t *squeue_create(squeue_t *, uint32_t, processorid_t, void (*)(), void *, clock_t, pri_t); void squeue_bind(squeue_t *, processorid_t); void squeue_unbind(squeue_t *); void squeue_enter(squeue_t *, mblk_t *, void (*)(), void *); void squeue_fill(squeue_t *, mblk_t *, void (*)(), void *);
Squeue_create instantiates a new squeue and uses
squeue_bind()/squeue_unbind() to bind or unbind itself from a
particular CPU. The squeue once created are never destroyed. The
squeue_enter() is used to try and access the squeue and the thread
entering is allowed to process and drain the squeue based on models
discussed before. squeue_fill() is used just to queue a packet on the
squeue to be processed by worker thread or other threads.
IP classifier
The IP connection fanout mechanism consists of 3 hash tables. A 5-tuple
hash table {protocol, remote and local IP addresses, remote and local
ports} to keep fully qualified TCP (ESTABLISHED) connections, A 3-tuple
lookup consisting of protocol, local address and local port to keep the
listeners and a single-tuple lookup for protocol listeners. As part of
the lookup, a connection structure (a superset of all connection
information) is returned. This connection structure is called 'conn_t'
and is abstracted below.
typedef struct conn_s { kmutex_t conn_lock; /* Lock for conn_ref */ uint32_t conn_ref; /* Reference counter */ uint32_t conn_flags; /* Flags */
struct ill_s *conn_ill; /* The ill packets are coming on */ struct ire_s *conn_ire; /* ire cache for outbound packets */ tcp_t *conn_tcp; /* Pointer to tcp struct */ void *conn_ulp /* Pointer for upper layer*/ edesc_pf conn_send; /* Function to call on read side */ edesc_pf conn_recv; /* Function to call on write side */ squeue_t *conn_sqp; /* Squeue for processing */
/* Address and Ports */ struct { in6_addr_t connua_laddr; /* Local address */ in6_addr_t connua_faddr; /* Remote address. */ } connua_v6addr; #define conn_src V4_PART_OF_V6(connua_v6addr.connua_laddr) #define conn_rem V4_PART_OF_V6(connua_v6addr.connua_faddr) #define conn_srcv6 connua_v6addr.connua_laddr #define conn_remv6 connua_v6addr.connua_faddr union { /* Used for classifier match performance */ uint32_t conn_ports2; struct { in_port_t tcpu_fport; /* Remote port */ in_port_t tcpu_lport; /* Local port */ } tcpu_ports; } u_port; #define conn_fport u_port.tcpu_ports.tcpu_fport #define conn_lport u_port.tcpu_ports.tcpu_lport #define conn_ports u_port.conn_ports2 uint8_t conn_protocol; /* protocol type */ kcondvar_t conn_cv; } conn_t;
The interesting member to note is the pointer to the squeue or vertical
perimeter. The lookup is done outside the perimeter and the packet is
processed/queued on the squeue connection is attached to. Also,
conn_recv and conn_send point to the read side and write side
functions. The read side function can be 'tcp_input' if the packet is
meant for TCP.
Also, the connection fan-out mechanism has provisions for supporting
wildcard listener's i.e. INADDR ANY. Currently, the connected and bind
tables are primarily for TCP and UDP only. A listener entry is made
during a listen() call. The entry is made into the connected table
after the three-way handshake is complete for TCP.
The IPCLassifier APIs look like:
conn_t *ipcl_conn_create(uint32_t type, int sleep); void ipcl_conn_destroy(conn_t *connp);
int ipcl_proto_insert(conn_t *connp, uint8_t protocol); int ipcl_proto_insert_v6(conn_t *connp, uint8_t protocol); conn_t *ipcl_proto_classify(uint8_t protocol); int *ipcl_bind_insert(conn_t *connp, uint8_t protocol, ipaddr_t src, uint16_t lport); int *ipcl_bind_insert_v6(conn_t *connp, uint8_t protocol, const in6_addr_t * src, uint16_t lport); int *ipcl_conn_insert(conn_t *connp, uint8_t protocol, ipaddr_t src, ipaddr_t dst, uint32_t ports); int *ipcl_conn_insert_v6(conn_t *connp, uint8_t protocol, in6_addr_t *src, in6_addr_t *dst, uint32_t ports); void ipcl_hash_remove(conn_t *connp); conn_t *ipcl_classify_v4(mblk_t *mp); conn_t *ipcl_classify_v6(mblk_t *mp); conn_t *ipcl_classify(mblk_t *mp);
The names of the functions are pretty self explanatory.
Synchronization
mechanism
Since the stack is fully multi-threaded (barring the per CPU
serialization enforced by the vertical perimeter), it uses a reference
based scheme to ensure that connection instance are available when
needed. The reference count is implemented by 'conn_t' member
'conn_ref' and protected by 'conn_lock'. The prime purpose of the lock
in not to protect bulk of 'conn_t' but just the reference count. Each
time some entity takes reference to the data structure (stores a
pointer to the data structure for later processing), it increments the
reference count by calling the CONN_INC_REF macro which basically
acquires the 'conn_lock', increments the 'conn_ref' and drops the
'conn_lock'. Each time the entity drops the reference to the connection
instance, it drops its reference using the CONN_DEC_REF macro.
For an established TCP connection, There are guaranteed to be 3
references on it. Each protocol layer has a reference on the instance
(one each for TCP and IP) and the classifier itself has a reference
since its a established connection. Each time a packet arrive for the
connection and classifier looks up the connection instance, an extra
reference is place which is dropped when the protocol layer finishes
processing that packet. Similarly, any timers running on the connection
instance have a reference to ensure that the instance is around
whenever timer fires. The memory associated with the connection
instance is freed once the last reference is dropped.
3 TCP
Solaris 10 provides the same view for TCP as previous releases i.e. TCP
appears as a clone device but it is actually a composite, with the TCP
and IP code merged into a single D_MP STREAMS module. The merged TCP/IP
module's STREAMS entry points for open and close are the same as IP's
entry points viz ip_open and ip_close. Based on the major number passed
during open, IP decides whether the open corresponds to a TCP open or
an IP open. The put and service STREAMS entry points for TCP are
tcp_wput, tcp_wsrv and tcp_rsrv. The tcp_wput entry point simply serves
as a wrapper routine and enable sockfs and other modules from the top
to talk to TCP using STREAMs. Note that tcp_rput is missing since IP
calls TCP functions directly. IP's STREAMS entry points remain
unchanged.
The operational part of TCP is fully protected by the vertical
perimeter which entered through the squeue_* primitives as illustrated
in Fig 4. Packets flowing from the top enter into TCP through the
wrapper function tcp_wput, which then tries to execute the real TCP
output processing function tcp_output after entering the corresponding
vertical perimeter. Similarly packets coming from the bottom try to
execute the real TCP input processing function tcp_input after entering
the vertical perimeter. There are multiple entry points into TCP
through the vertical perimeter.

Fig. 4
tcp_input - All inbound data packets and control messages tcp_output - All outbound data packets and control messages tcp_close_output - On user close tcp_timewait_output - timewait expiry tcp_rsrv_input - Flowcontrol relief on read side. tcp_timer - All tcp timers
The Interface between TCP and IP
FireEngine changes the interface between TCP and IP from the existing
STREAMS based message passing interface to a functional call based
interface, both in the control and data paths. On the outbound side TCP
passes a fully prepared packet directly to IP by calling ip_output,
while being inside the vertical perimeter.
Similarly control messages are also passed directly as function
arguments. ip_bind_v{4, 6} receives a bind message as an argument,
performs the required action and returns a result mp to the caller. TCP
directly calls ip_bind_v{4, 6} in the connect(), bind() and listen()
paths. IP still retains all its STREAMs entry point but TCP (/dev/tcp)
becomes a real device driver i.e. It can't be pushed over other device
drivers.
The basic protocol processing code was unchanged. Lets have a look at
common socket calls and see how they interact with the framework.
Socket
A socket open of TCP or open of /dev/tcp eventually calls into ip_open.
The open then calls into the IP connection classifier and allocates the
per-TCP endpoint control block already integrated with the conn_t. It
chooses the squeue for this connection. In the case of an internal open
i.e by sockfs for an acceptor stream, almost nothing is done, and we
delay doing useful work till accept time.
Bind
tcp_bind eventually needs to talk to IP to figure out whether the
address passed in is valid. FireEngine TCP prepares this request as
usual in the form of a TPI message. However this messages is directly
passed as a function argument to ip_bind_v{4, 6}, which returns the
result as another message. The use of messages as parameters is helpful
in leveraging the existing code with minimal change. The port hash
table used by TCP to validate binds still remains in TCP, since the
classifier has no use for it.
Connect
The changes in tcp_connect are similar to tcp_bind. The full bind()
request is prepared as a TPI message and passed as a function argument
to ip_bind_v{4, 6}. IP calls into the classifier and inserts the
connection in the connected hash table. The conn_ hash table in TCP is
no longer used.
Listen
This path is part of tcp_bind. The tcp_bind prepares a local bind TPI
message and passes it as a function argument to ip_bind_v{4, 6}. IP
calls the classifier and inserts the connection in the bind hash table.
The listen hash table of TCP does not exist any more.
Accept
The pre Solaris 10 accept implementation did the bulk of the connection
setup processing in the listener context. The three way handshake was
completed in listener's perimeter and the connection indication was
sent up the listener's STREAM. The messages necessary to perform the
accept were sent down on the listener STREAM and the listener was
single threaded from the point of sending the T_CONN_RES message to TCP
till sockfs received the acknowledgment. If the connection arrival rate
was high, the ability of pre Solaris 10 stack to accept new connections
deteriorated significantly.
Furthermore, there were some additional TCP overhead involved, which
contribute to slower accept rate. When sockfs opened an acceptor STREAM
to TCP to accept a new connection, TCP was not aware that the data
structures necessary for the new connection have already been
allocated. So it allocated new structures and initializes them but
later as part of the accept processing these are freed. Another major
problem with the pre Solaris 10 design was that packets for a newly
created connection arrived on the listener's perimeter. This requires a
check for every incoming packet and packets landing on the wrong
perimeter need to be sent to their correct perimeter causing additional
delay.
The FireEngine model establishes an eager connection (a incoming
connection is called eager till accept completes) in its own perimeter
as soon as a SYN packet arrives thus making sure that packets always
land on the correct connection. As a result it is possible to
completely eliminate the TCP global queues. The connection indication
is still sent to the listener on the listener's STREAM but the accept
happens on the newly created acceptor STREAM (thus, there is no need to
allocate data structures for this STREAM) and the acknowledgment can be
sent on the acceptor STREAM. As a result, sockfs doesn't need to become
single threaded at any time during the accept processing.
The new model was carefully implemented because the new incoming
connection (eager) exists only because there is a listener for it and
both eager and listener can disappear at any time during accept
processing as a result of eager receiving a reset or listener closing.
The eager starts out by placing a reference on the listener so that the
eager reference to the listener is always valid even though the
listener might close. When a connection indication needs to be sent
after the three way handshake is completed, the eager places a
reference on itself so that it can close on receiving a reset but any
reference to it is still valid. The eager sends a pointer to itself as
part of the connection indication message, which is sent via the
listener's STREAM after checking that the listener has not closed. When
the T_CONN_RES message comes down the newly created acceptor STREAM, we
again enter the eager's perimeter and check that the eager has not
closed because of receiving a reset before completing the accept
processing. For TLI/XTI based applications, the T_CONN_RES message is
still handled on the listener's STREAM and the acknowledgment is sent
back on listener's STREAMs so there is no change in behavior.
Close
Close processing in tcp now does not have to wait till the reference
count drops to zero since references to the closing queue and
references to the TCP are now decoupled. Close can return as soon as
all references to the closing queue are gone. The TCP data structures
themself may continue to stay around as a detached TCP in most cases.
The release of the last reference to the TCP frees up the TCP data
structure.
A user initiated close only closes the stream. The underlying TCP
structures may continue to stay around. The TCP then goes through the
FIN/ACK exchange with the peer after all user data is transferred and
enters the TIME_WAIT state where it stays around for a certain duration
of time. This is called a detached TCP. These detached TCPs also need
protection to prevent outbound and inbound processing from happening at
the same time on a given detached TCP.
Data path
TCP does not even need to call IP to transmit the outbound packet in
the most common case, if it can access the IRE. With a merged TCP/IP we
have the advantage of being able to access the cached ire for a
connection, and TCP can putnext the data directly to the link layer
driver based on the information in the IRE. FireEngine does exactly the
above.
TCP Loopback
TCP Fusion is a protocol-less data path for loopback TCP connections in
Solaris 10. The fusion of two local TCP endpoints occurs at connection
establishment time. By default, all loopback TCP connections are fused.
This behavior may be changed by setting the system wide tunable do tcp
fusion to 0. Various conditions on both endpoints need to be met for
fusion to be successful:
- They must share a common squeue.
- They must be TCP and not "raw socket".
- They must not require protocol-level processing, i.e. IPsec or IPQoS policy is not present
for the connection.
If it fails, we fall back to the regular TCP data path; if it succeeds,
both endpoints proceed to use tcp fuse output() as the transmit path.
tcp fuse output() enqueues application data directly onto the peer's
receive queue; no protocol processing is involved. After enqueueing the
data, the sender can either push - by calling putnext(9F) - the data up
the receiver's read queue; or the sender can simply return and let the
receiver retrieve the enqueued data via the synchronous STREAMS entry
point. The latter path is taken if synchronous STREAMS is enabled.It
gets automatically disabled if sockfs no longer resides directly on top
of TCP module due to a module insertion or removal.
Locking in TCP Fusion is handled by squeue and the mutex tcp fuse lock.
One of the requirements for fusion to succeed is that both endpoints
need to be using the same squeue. This ensures that neither side can
disappear while the other side is still sending data. By itself, squeue
is not sufficient for guaranteeing safe access when synchronous STREAMS
is enabled. The reason is that tcp fuse rrw() doesn't enter the squeue,
and its access to tcp rcv list and other fusion-related fields needs to
be synchronized with the sender. tcp fuse lock is used for this
purpose.
Rate Limit for Small Writes Flow control for TCP Fusion in synchronous
stream mode is achieved by checking the size of receive buffer and the
number of data blocks, both set to different limits. This is different
than regular STREAMS flow control where cumulative size check dominates
data block count check (STREAMS queue high water mark typically
represents bytes). Each enqueue triggers notifications sent to the
receiving process; a build up of data blocks indicates a slow receiver
and the sender should be blocked or informed at the earliest moment
instead of further wasting system resources. In effect, this is
equivalent to limiting the number of outstanding segments in flight.
The minimum number of allowable enqueued data blocks defaults to 8 and
is changeable via the system wide tunable tcp_fusion_burst_min to
either a higher value or to 0 (the latter disables the burst check).
4 UDP
Apart from the framework improvements, Solaris 10 made additional
changes in the UDP packets move through the stack. The internal code
name for the project was "Yosemite". Pre Solaris 10, the UDP processing
cost was evenly divided between per packet processing cost and per byte
processing cost. The packet processing cost was generally due to
STREAMS; the stream head processing; and packet drops in the stack and
driver. The per byte processing cost was due to lack of H/W cksum and
unoptimized code branches throughout the network stack.
UDP packet
drop within the stack
Although UDP is supposed to be unreliable, the local area networks have
become pretty reliable and applications tend to assume that there will
be no packet loss in a LAN environment. This assumption was largely
true but pre Solaris 10 stack was not very effective in dealing with
UDP overload and tended to drop packets within the stack itself.
On Inbound, packets were dropped at more than one layers throughout the
receive path. For UDP, the most common and obvious place is at the IP
layer due to the lack of resources needed to queue the packets. Another
important yet in-apparent place of packet drops is at the network
adapter layer. This type of drop is fairly common to occur when the
machine is dealing with a high rate of incoming packets.
UDP sockfs The UDP sockfs extension (sockudp) is an alternative path to
socktpi used for handling sockets-based UDP applications. It provides
for a more direct channel between the application and the network stack
by eliminating the stream head and TPI message-passing interface. This
allows for a direct data and function access throughout the socket and
transport layers. This allows the stack to become more efficient and
coupled with UDP H/W checksum offload (even for fragmented UDP),
ensures that UDP packets are rarely dropped within the stack.
UDP Module
A fully multi-threaded UDP module running under the same protection
domain as IP. It allows for a tighter integration of the transport
(UDP) with the layers above and below it. This allows socktpi to make
direct calls to UDP. Similarly UDP may also make direct calls to the
data link layer. In the post GLDv3 world, the data link layer may also
make direct calls to the transport. In addition, utility functions can
be called directly instead of using message-based interface.
UDP needs exclusive operation on a per-endpoint basis, when executing
functions that modify the endpoint state. udp rput other() deals with
packets with IP options, and processing these packets end up having to
update the endpoint's option related state. udp wput other() deals with
control operations from the top, e.g. connect(3SOCKET) that needs to
update the endpoint state. In the STREAMS world this synchronization
was achieved by using shared inner perimeter entry points, and by using
qwriter inner() to get exclusive access to the endpoint.
The Solaris 10 model uses an internal, STREAMS-independent perimeter to
achieve the above synchronization and is described below:
- udp enter() - Enter the UDP endpoint perimeter. udp become
writer() i.e.become exclusive on the UDP endpoint. Specifies a function
that will be called exclusively either immediately or later when the
perimeter is available exclusively.
- udp exit() - Exit the UDP endpoint perimeter.
Entering UDP from the top or from the bottom must be done using udp
enter(). As in the general cases, no locks may be held across these
perimeter. When finished with the exclusive mode, udp exit() must be
called to get out of the perimeter.
To support this, the new UDP model employs two modes of operation
namely UDP MT HOT mode and UDP SQUEUE mode. In the UDP MT HOT mode,
multiple threads may enter a UDP endpoint concurrently. This is used
for sending or receiving normal data and is similar to the putshared
STREAMS entry points. Control operations and other special cases call
udp become writer() to become exclusive on a per-endpoint basis and
this results in transitioning to the UDP SQUEUE mode. squeue by
definition serializes access to the conn t. When there are no more
pending messages on the squeue for the UDP connection, the endpoint
reverts to MT HOT mode. In between when not all MT threads of an
endpoint have finished, messages are queued in the endpoint and the UDP
is in one of two transient modes, i.e. UDP MT QUEUED or UDP QUEUED
SQUEUE mode.
While in stable modes, UDP keeps track of the number of threads
operating on the endpoint. The udp reader count variable represents the
number of threads entering the endpoint as readers while it is in UDP
MT HOT mode. Transitioning to UDP SQUEUE happens when there is only a
single reader, i.e. when this counter drops to 1. Likewise, udp squeue
count represents the number of threads operating on the endpoint's
squeue while it is in UDP SQUEUE mode. The mode transition to UDP MT
HOT happens after the last thread exits the endpoint.
Though UDP and IP are running in the same protection domain, they are
still separate STREAMS modules. Therefore, STREAMS plumbing is kept
unchanged and a UDP module instance is always pushed above IP. Although
this causes an extra open and close for every UDP endpoint, it provides
backwards compatibility for some applications that rely on such
plumbing geometry to do certain things, e.g. issuing I POP on the
stream to obtain direct access to IP9.
The actual UDP processing is done within the IP instance. The UDP
module instance does not possess any state about the endpoint and
merely acts as a dummy module, whose presence is to keep the STREAMS
plumbing appearance unchanged.
Solaris 10 allows for the following plumbing modes:
- Normal - IP is first opened and later UDP is pushed directly on
top. This is the default action that happens when a UDP socket or
device is opened.
- SNMP - UDP is pushed on top of a module other than IP. When this
happens it will support only SNMP semantics.
These modes imply that we don't support any intermediate module between
IP and UDP; in fact, Solaris has never supported such scenario in the
past as the inter-layer communication semantics between IP and
transport modules are private.
UDP and
Socket interaction
A significant event that takes place during socket(3SOCKET) system call
is the plumbing of the modules associated with the socket's address
family and protocol type. A TCP or UDP socket will most likely result
in sockfs residing directly atop the corresponding transport module.
Pre Solaris 10, Socket layer used STREAMs primitives to communicate
with UDP module. Solaris 10 allowed for a functionally callable
interface which eliminated the need to use T UNITDATA REQ message for
metadata during each transmit from sockfs to UDP. Instead, data and its
ancillary information (i.e. remote socket address) could be provided
directly to an alternative UDP entry point, therefore avoiding the
extra allocation cost.
For transport modules, being directly beneath sockfs allows for
synchronous STREAMS to be used. This enables the transport layer to
buffer incoming data to be later retrieved by the application (via
synchronous STREAMS) when a read operation is issued, therefore
shortening the receive processing time.
Synchronous
STREAMS
Synchronous STREAMS is an extension to the traditional STREAMS
interface for message passing and processing. It was originally added
as part of the combined copy and checksum effort. It offers a way for
the entry point of the module or driver to be called in synchronous
manner with respect to user I/O request. In traditional STREAMS, the
stream head is the synchronous barrier for such request. Synchronous
STREAMS provides a mechanism to move this barrier from the stream head
down to a module below.
The TCP implementation of synchronous STREAMS in pre Solaris 10 was
complicated, due to several factors. A major factor was the combined
checksum and copyin/copyout operations. In Solaris 10, TCP wasn't
dependent on checksum during copyin/copyout, so the mechanism was
greatly simplified for use with loopback TCP and UDP on the read side.
The synchronous STREAMS entry points are called during requests such as
read(2) or recv(3SOCKET). Instead of sending the data upstream using
putnext(9F), these modules enqueue the data in their internal receive
queues and allow the send thread to return sooner. This avoids calling
strrput() to enqueue the data at the stream head from within the send
thread context, therefore allowing for better dynamics - reducing the
amount of time taken to enqueue and signal/poll-notify the receiving
application allows the send thread to return faster to do further work,
i.e. things are less serialized than before.
Each time data arrives, the transport module schedules for the
application to retrieve it. If the application is currently blocked
(sleeping) during a read operation, it will be unblocked to allow it to
resume execution. This is achieved by calling STR WAKEUP SET() on the
stream. Likewise, when there is no more data available for the
application, the transport module will allow it to be blocked again
during the next read attempt, by calling STR WAKEUP CLEAR(). Any new
data that arrives before then will override this state and cause
subsequent read operation to proceed.
An application may also be blocked in poll(2) until a read event takes
place, or it may be waiting for a SIGPOLL or SIGIO signal if the socket
used is non-blocking. Because of this, the transport module delivers
the event notification and/or signals the application each time it
receives data. This is achieved by calling STR SENDSIG() on the
corresponding stream.
As part of the read operation, the transport module delivers data to
the application by returning it from its read side synchronous STREAMS
entry point. In the case of loopback TCP, the synchronous STREAM read
entry point returns the entire content (byte stream) of its receive
queue to the stream head; any remaining data will be re-enqueued at the
stream head awaiting the next read. For UDP, the read entry point
returns only one message (datagram) at a time.
STREAMs fallback
By default, direct transmission and read side synchronous STREAMS
optimizations are enabled for all UDP and loopback TCP sockets when
sockfs is directly above the corresponding transport module. There are
several cases which require these features to be disabled; when this
happens, message exchange between sockfs and the transport module must
then be done through putnext(9F). The cases are described as follows -
- Intermediate Module - A module is configured to be autopushed at
open time on top of the transport module via autopush(1M), or is I
PUSH'd on a socket via ioctl(2).
- Stream Conversion - The imaginary sockmod module is I POP'd from
a socket causing it to be converted from a socket endpoint into a
device stream.
(Note that I INSERT or I REMOVE ioctl is not permitted on a socket
endpoint and therefore a fallback is not required to handle it.)
If a fallback is required, sockfs will notify the transport module that
direct mode is disabled. The notification is sent down by the sockfs
module in the form of an ioctl message, which indicates to the
transport module that putnext(9F) must now be used to deliver data
upstream. This allows for data to flow through the intermediate module
and it provides for compatibility with device stream semantics.
5 IP
As mentioned before, all the transport layers have been merged in IP
module which is fully multithreaded and acts as a pseudo device driver
as well a STREAMs module. The key change in IP was the removal IP
client functionality and multiplexing the inbound packet stream. The
new IP Classifier (which is still part of IP module) is responsible for
classifying the inbound packets to the correct connection instance. IP
module is still responsible for network layer protocol processing and
plumbing and managing the network interfaces.
Lets have a quick look at how plumbing of network interfaces, multi
pathing, and multicast works in the new stack.
Plumbing NICs
Plumbing is a long sequence of operations involving message exchanges
between IP, ARP and device drivers. Most set ioctls are typically
involved in plumbing operations. A natural model is to serialize these
ioctls one per ill. For example plumbing of hme0 and qfe0 can go on in
parallel without any interference. But various set ioctls on hme0 will
all be serialized.
Another possibility is to fine-grain even further and serialize
operations per ipif rather than per ill. This will be beneficial only
if many ipifs are hosted on an ill, and if the operations on different
ipifs don't have any mutual interference. Another possibility is to
completely multithread all ioctls using standard Solaris MT techniques.
But this is needlessly complex and does not have much added value. It
is hard to hold locks across the entire plumbing sequence, which
involves waits, and message exchanges with drivers or other modules.
Not much is gained in performance or functionality by simultaneously
allowing multiple set ioctls on an ipif at the same time since these
are purely non-repetitive control operations. Broadcast ires are
created on a per ill basis rather than per ipif basis. Hence trying to
bring up more than 1 ipif simultaneously on an ill involves extra
complexity in the broadcast ire creation logic. On the other hand
serializing plumbing operations per ill lends itself easily to the
existing IP code base. During the course of plumbing IP exchanges
messages with the device driver and ARP. The messages received from the
underlying device driver are also handled exclusively in IP. This is
convenient since we can't hold standard mutex locks across the putnext
in trying to provide mutual exclusion between the write side and read
side activities. Instead of the all exclusive PERMOD syncq, this effect
can be easily achieved by using a per ill serialization queue.
IP Network
MultiPathing (IPMP)
IPMP operations are all driven around the notion of an IPMP group.
Failover and Failback operations operate between 2 ills, usually part
of the same IPMP group. The ipifs and ilms are moved between the ills.
This involves bringing down the source ill and could involve bringing
up the destination ill. Bringing down or bringing up ills affect
broadcast ires. Broadcast ires need to be grouped per IPMP group to
suppress duplicate broadcast packets that are received. Thus broadcast
ire manipulation affects all members of the IPMP group. Setting
IFF_FAILED or IFF_STANDBY causes evaluation of all ills in the IPMP
group and causes regrouping of broadcast ires. Thus serializing IPMP
operations per IPMP group lends itself easily to the existing code
base. An IPMP group includes both the IPv4 and IPv6 ills.
Multicast
Multicast joins operate on both the ilg and ilm structures. Multiple
threads operating on an ipc (socket) trying to do multicast joins need
to synchronize when operating on the ilg. Multiple threads potentially
operating on different ipcs (socket endpoints) trying to do multicast
joins could eventually end up trying to manipulate the ilm
simultaneously and need to synchronize on the access to the ilm. Both
are amenable to standard Solaris MT techniques. Considering all the
above, i.e. plumbing, IPMP and multicast, the common denominator is to
serialize all the exclusive operations on a per IPMP group basis. If
IPMP is not enabled, then on a phyint basis. E.g. hme0 v4 and hme0 v6
ills taken together share a phyint. In the above multicast has a
potential higher degree of multithreading. But it has to coexist with
other exclusive operations. For example we don't want a thread to
create or delete an ilm when a failover operation is already in
progress trying to move ilms between 2 ills. So the lowest common
denominator is to serialize multicast joins per physical interface or
IPMP group.
(2005-11-14 23:39:37.0)
Permalink
Solaris Networking - The Magic Revealed (Part II)
Solaris Networking - The Magic Revealed
Many of you have asked for details on Solaris 10 networking. The
great news is that I finished writing the treatise on the subject which
will become a new section in Solaris Internals book
by Jim Mauro and Richard Mcdougall.
In the meawhile, I have used some excerpts to create a mini book (part
II) for Networking
community on OpenSolaris. Enjoy! As usual, comments
(good or bad) are welcome.
Solaris Networking - The Magic Revealed
(Part II)
6. Solaris 10 Device
Driver framework
Lets have a quick look at how Network device drivers were implemented
pre Solaris 10 and why they need to change with the new Solaris 10
stack.
GLDv2 and
Monolithic DLPI drivers (Solaris 9 and before)
Pre Solaris 10, network stack relays on DLPI1 providers, which are
normally implemented in one of two ways. The following illustrations
(Fig 5) show a stack based on a so-called monolithic DLPI driver and a
stack based on a driver utilizing the Generic LAN Driver (GLDv2)
module.

Fig. 5
The GLDv2 module essentially behaves as a library. The client still
talks to the driver instance bound to the device but the DLPI protocol
processing is handled by calling into the GLDv2 module, which will then
call back into the driver to access the hardware. Using the GLD module
has a clear advantage in that the driver writer need not re-implement
large amounts of mostly generic DLPI protocol processing. Layer two
(Data-Link) features such as 802.1q Virtual LANs (VLANs) can also be
implemented centrally in the GLD module allowing them to be leveraged
by all drivers. The architecture still poses a problem though when
considering how to implement a feature such as 802.3ad link aggregation
(a.k.a. trunking) where the one-to-one correspondence between network
interface and device is broken.
Both GLDv2 and monolithic driver depend on DLPI messages and
communicated with upper layers via STREAMs framework. This mechanism
was not very effective for link aggregation or 10Gb NICs. With the new
stack, a better mechanism was needed which could ensure data locality
and allow the stack to control the device drivers at much finer
granularity to deal with interrupts.
GLDv3 - A New
Architecture
Solaris 10 introduced a new device driver framework called GLDv3
(internal name "project Nemo") along with the new stack. Most of the
major device drivers were ported to this framework and all future and
10Gb device drivers will be based on this framework. This framework
also provided a STREAMs based DLPI layer for backword compatibility (to
allow external, non-IP modules to continue to work).
GLDv3 architecture virtualizes layer two of the network stack. There is
no longer a one-to-one correspondence between network interfaces and
devices. The illustration below (Fig. 6) shows multiple devices
registered with a MAC Services Module (MAC). It also shows two clients:
one traditional client that communicates via DLPI to a Data-Link Driver
(DLD) and one that is kernel based and simply makes direct function
calls into the Data-Link Services Module (DLS).

Fig. 6
GLDv3 Drivers
GLDv3 drivers are similar to GLD drivers. The driver must be linked
with a dependency on misc/mac. and misc/dld. It must call
mac_register() with a pointer to an instance of the following structure
to register with the MAC module:
typedef struct mac { const char *m_ident; mac_ext_t *m_extp; struct mac_impl *m_impl; void *m_driver; dev_info_t *m_dip; uint_t m_port; mac_info_t m_info; mac_stat_t m_stat; mac_start_t m_start; mac_stop_t m_stop; mac_promisc_t m_promisc; mac_multicst_t m_multicst; mac_unicst_t m_unicst; mac_resources_t m_resources; mac_ioctl_t m_ioctl; mac_tx_t m_tx; } mac_t;
This structure must persist for the lifetime of the registration, i.e.
it cannot be de-allocated until after mac_unregister() is called. A
GLDv3 driver _init(9E) entry point is also required to call
mac_init_ops() before calling mod_install(9F), and they are required to
call mac_fini_ops() after calling mod_remove(9F) from _fini(9E).
The important members of this 'mac_t' structure are:
- 'm_impl' - This is used by the MAC module to point to its private
data. It must not be read or
modified by a driver.
- 'm_driver' - This field should be set by the driver to point at
its private data. This
value will be supplied as the first argument to the driver entry points.
- 'm_dip' - This field must be set to the dev_info_t pointer of the
driver instance calling
mac_register().
- 'm_info' - This is an embedded structure defined as follows:
typedef struct mac_info { uint_t mi_media; uint_t mi_sdu_min; uint_t mi_sdu_max; uint32_t mi_cksum; uint32_t mi_poll; boolean_t mi_stat[MAC_NSTAT]; uint_t mi_addr_length; uint8_t mi_unicst_addr[MAXADDRLEN]; uint8_t mi_brdcst_addr[MAXADDRLEN]; } mac_info_t;
mi_media is set of be the media type; mi_sdu_min is the minimum payload
size; mi_sdu_max is the maximum payload size; mi_cksum details the
device cksum capabilities flag; mi_poll details if the driver supports
polling; mi_addr_length is set to the length of the addresses used by
the media; mi_unicst_addr is set with the unicast address of the device
at the point at which mac_register() is called;mi_brdcst_addr is set to
the broadcast address of the media; mi_stat is an array of boolean
values
typedef enum { MAC_STAT_IFSPEED = 0, MAC_STAT_MULTIRCV, MAC_STAT_BRDCSTRCV, MAC_STAT_MULTIXMT, MAC_STAT_BRDCSTXMT, MAC_STAT_NORCVBUF, MAC_STAT_IERRORS, MAC_STAT_UNKNOWNS, MAC_STAT_NOXMTBUF, MAC_STAT_OERRORS, MAC_STAT_COLLISIONS, MAC_STAT_RBYTES, MAC_STAT_IPACKETS, MAC_STAT_OBYTES, MAC_STAT_OPACKETS,
MAC_STAT_ALIGN_ERRORS, MAC_STAT_FCS_ERRORS, MAC_STAT_FIRST_COLLISIONS, MAC_STAT_MULTI_COLLISIONS, MAC_STAT_SQE_ERRORS, MAC_STAT_DEFER_XMTS, MAC_STAT_TX_LATE_COLLISIONS, MAC_STAT_EX_COLLISIONS, MAC_STAT_MACXMT_ERRORS, MAC_STAT_CARRIER_ERRORS, MAC_STAT_TOOLONG_ERRORS, MAC_STAT_MACRCV_ERRORS,
MAC_STAT_XCVR_ADDR, MAC_STAT_XCVR_ID, MAC_STAT_XVCR_INUSE, MAC_STAT_CAP_1000FDX, MAC_STAT_CAP_1000HDX, MAC_STAT_CAP_100FDX, MAC_STAT_CAP_100HDX, MAC_STAT_CAP_10FDX, MAC_STAT_CAP_10HDX, MAC_STAT_CAP_ASMPAUSE, MAC_STAT_CAP_PAUSE, MAC_STAT_CAP_AUTONEG, MAC_STAT_ADV_CAP_1000FDX, MAC_STAT_ADV_CAP_1000HDX, MAC_STAT_ADV_CAP_100FDX, MAC_STAT_ADV_CAP_100HDX, MAC_STAT_ADV_CAP_10FDX, MAC_STAT_ADV_CAP_10HDX, MAC_STAT_ADV_CAP_ASMPAUSE, MAC_STAT_ADV_CAP_PAUSE, MAC_STAT_ADV_CAP_AUTONEG, MAC_STAT_LP_CAP_1000FDX, MAC_STAT_LP_CAP_1000HDX, MAC_STAT_LP_CAP_100FDX, MAC_STAT_LP_CAP_100HDX, MAC_STAT_LP_CAP_10FDX, MAC_STAT_LP_CAP_10HDX, MAC_STAT_LP_CAP_ASMPAUSE, MAC_STAT_LP_CAP_PAUSE, MAC_STAT_LP_CAP_AUTONEG, MAC_STAT_LINK_ASMPAUSE, MAC_STAT_LINK_PAUSE, MAC_STAT_LINK_AUTONEG, MAC_STAT_LINK_DUPLEX, MAC_STAT_LINK_STATE, MAC_NSTAT /* must be the last entry */ } mac_stat_t;
The macros MAC_MIB_SET(), MAC_ETHER_SET() and MAC_MII_SET() are
provided to set all the values in each of the three groups respectively
to B_TRUE.
MAC
Services (MAC) module
Some key Driver Support Functions:
- 'mac_resource_add' -
extern mac_resource_handle_t mac_resource_add(mac_t *, mac_resource_t *);
Various members are defined as
typedef void (*mac_blank_t)(void *, time_t, uint_t); typedef mblk_t *(*mac_poll_t)(void *, uint_t);
typedef enum { MAC_RX_FIFO = 1 } mac_resource_type_t;
typedef struct mac_rx_fifo_s { mac_resource_type_t mrf_type; /* MAC_RX_FIFO */ mac_blank_t mrf_blank; mac_poll_t mrf_poll; void *mrf_arg; time_t mrf_normal_blank_time; uint_t mrf_normal_pkt_cnt; } mac_rx_fifo_t;
typedef union mac_resource_u { mac_resource_type_t mr_type; mac_rx_fifo_t mr_fifo; } mac_resource_t;
This function should be called from the m_resources() entry point to
register individual receive resources (commonly ring buffers of DMA
descriptors) with the MAC module. The returned mac_resource_handle_t
value should then be supplied in calls to mac_rx(). The second argument
to mac_resource_add() specifies the resource being added. Resources are
specified by the mac_resource_t structure. Currently only resources of
type MAC_RX_FIFO are supported. MAC_RX_FIFO resources are described by
the mac_rx_fifo_t structure.
This mac_blank function is meant to be used by upper layers to control
the interrupt rate of the device. The first argument is the device
context meant to be used as the first argument to poll_blank.
The other fields mrf_normal_blank_time and mrf_normal_pkt_cnt specify
the default interrupt interval and packet count threshold,
respectively. These parameters may be used as the second and third
arguments to mac_blank when the upper layer wants the driver to revert
to the default interrupt rate.
The interrupt rate is controlled by the upper layer by calling
poll_blank with different arguments. The interrupt rate can be
increased or decreased by the upper layer by passing a multiple of
these values to the last two arguments of mac_blank. Setting these
avlues to zero disables the interrupts and NIC is deemed to be in
polling mode.
The mac_poll is the driver supplied function is used by upper layer to
retrieve a chain of packets (upto max count specified by second
argument) from the Rx ring corresponding to the earlier supplied
mrf_arg during mac_resource_add (supplied as first argument to
mac_poll).
Data-Link Services
(DLS) Module
The DLS module provides Data-Link Services interface analogous to DLPI.
The DLS interface is a kernel-level functional interface as opposed to
the STREAMS message based interface specified by DLPI. This module
provides the interfaces necessary for upper layer to create and destroy
a dala link service; It also provides the interfaces necessary to plumb
and unplumb the NIC. The plumbing and unplumbing of NIC for GLDv3 based
device drivers is unchanged from the older GLDv2 or monolithic DLPI
device drivers. The major changes are in data paths which allow direct
calls, packet chains and much finer grained control over NIC.
Data-Link
Driver (DLD)
The Data-Link Driver provides a DLPI using the interfaces provided by
the DLS and MAC modules. The driver is configured using IOCTLs passed
to a control node. These IOCTLs create and destroy separate DLPI
provider nodes. This module deals with DLPI messages necessary to
plumb/unplumb the NIC and provides the backward compatibility for data
path via STREAMs for non GLDv3 aware clients.
GLDv3 Link
aggregation architecture
The GLDv3 framework provides support for Link Aggregation as defined by
IEEE 802.3ad. The key design principles while designing this facility
were:
- Allow GLDv3 MAC drivers to be aggregated without code change
- The performance of
non-aggregated devices must be preserved
- The performance of
aggregated devices should be cumulative of line rate for each member i.e.
minimal overheads due to aggregation
- Support both manual
configuration and Link Aggregation Control protocol (LACP)
GLDv3 link aggregation is implement by means of a pseudo driver called
'aggr'. It registers virtual ports corresponding to link aggregation
groups with the GLDv3 Mac layer. It uses the client interface provided
by MAC layer to control and communicate with aggregated MAC ports as
illustrated below in Fig 7. It also export a pseudo 'aggr' device
driver which is used by 'dladm' command to configure and control the
link aggregated interface. Once a MAC port is configured to be part of
link aggregation group, it cannot be simultaneously accessed by other
MAC clients clients such as DLS layer. The exclusive access is enforced
by the MAC layer. The implementation of LACP is implemented by the
'aggr' driver which has access to individual MAC ports or links.

Fig. 7
The GLDv3 aggr driver acts a normal MAC module to upper layer and
appears as a standard NIC interface which once created with 'dladm',
can be configured and managed by 'ifconfig'. The 'aggr' module
registers each MAC port which is part of the aggregation with the upper
layer using the 'mac_resource_add' function such that the data paths
and interrupts from each MAC port can be independently managed by the
upper layers (see Section 8b). In short, the aggregated interface is
managed as a single interface with possibly one IP address and the data
paths are managed as individual NICs by unique CPUs/Squeues providing
aggregation capability to Solaris with near zero overheads and linear
scalability with respect to number of MAC ports that are part of the
aggregation group.
Checksum offload
Solaris 10 improved the H/W checksum offload capability further to
improve overall performance for most applications. 16-bit one's
complement checksum offload framework has existed in Solaris for some
time. It was originally added as a requirement for Zero Copy TCP/IP in
Solaris 2.6 but was never extended until recently to handle other
protocols. Solaris defines two classes of checksum offload:
- Full - Complete checksum calculation in the hardware, including
pseudo-header checksum computation for TCP and UDP packets. The
hardware is assumed to have the ability to parse protocol headers.
- Partial - "Dumb" one's complement checksum based on start, end
and stuff offsets describing the span of the checksummed data and the
location of the transport checksum field, with no pseudo-header
calculation ability in the hardware.
Adding support for non-fragmented IPV4 cases (unicast or multicast) is
trivial for both transmit and receive, as most modern network adapters
support either class of checksum offload with minor differences in the
interface. The IPV6 cases are not as straightforward, because very few
full-checksum network adapters are capable of handling checksum
calculation for TCP/UDP packets over IPV64.
The fragmented IP cases have similar constraints. On transmit,
checksumming applies to the unfragmented datagram. In order for an
adapter to support checksum offload, it must be able to buffer all of
the IP fragments (or perform the fragmentation in hardware) before
finally calculating the checksum and sending the fragments over the
wire; until then, checksum offloading for outbound IP fragments cannot
be done. On the other hand, the receive fragment reassembly case is
more flexible since most full-checksum (and all partial-checksum)
network adapters are able to compute and provide the checksum value to
the network stack. During fragment reassembly stage, the network stack
can derive the checksum status of the unfragmented datagram by
combining the values altogether.
Things were simplified by not offloading checksum when IP option were
present. For partial-checksum offload, certain adapters limit the start
offset to a width sufficient for simple IP packets. When the length of
protocol headers exceeds such limit (due to the presence of options),
the start offset will wrap around causing incorrect calculation. For
full-checksum offload, none of the capable adapters is able to
correctly handle IPV4 source routing option.
When transmit checksum offload takes place, the network stack will
associate eligible packets with ancillary information needed by the
driver to offload the checksum computation to hardware.
In the inbound case, the driver has full control over the packets that
get associated with hardware-calculated checksum values. Once a driver
advertises its capability via DL CAPAB HCKSUM, the network stack will
accept full and/or partial-checksum information for IPV4 and IPV6
packets. This process happens for both non-fragmented and fragmented
payloads.
Fragmented packets will first need to go through the reassembly process
because checksum validation happens for fully reassembled datagrams.
During reassembly, the network stack combines the hardware-calculated
checksum value of each fragment.
'dladm' - New
command for datalink administration
Over period of time, 'ifconfig' has become severely overloaded trying
to manage various layers in the stack. Solaris 10 introduced 'dladm'
command to manage the data link services and ease the burden on
'ifconfig'. The dladm command operates on three kinds of object:
- 'link' - Data-links, identified by a name
- 'aggr' - Aggregations of network devices, identified by a key
- 'dev' - Network devices, identified by concatenation of a driver
name and an instance number.
The key of an aggregation must be an integer value between 1 and 65535.
Some devices do not support configurable data-links or aggregations.
The fixed data-links provided by such devices can be viewed using dladm
but not configured.
The GLDv3 framework allows users to select the outbound load balancing
policy across various members of aggregation while configuring the
aggregation. The policy specifies which dev object is used to send
packets. A policy consists of a list of one or more layers specifiers
separated by commas. A layer specifier is one of the following:
- L2 - Select outbound device according to source and destination
MAC addresses of the packet.
- L3 - Select outbound device according to source and destination
IP addresses of the packet.
- L4 - Select outbound device according to the upper layer protocol
information contained in the packet. For TCP and UDP, this includes
source and destination ports. For IPsec, this includes the SPI
(Security Parameters Index.)
For example, to use upper layer protocol information, the following
policy can be used:
-P L4
To use the source and destination MAC addresses as well as the source
and destination IP addresses, the following policy can be used:
-P L2,L3
The framework also supports Link aggregation control protocol (LACP)
for GLDv3 based aggregations which can be controlled by 'dladm' via
the 'lacp-mode' and 'lacp-timer' sub commands. The 'lacp-mode'
can be set to 'off', 'active' or 'passive'.
When a new device is inserted into a system. During reconfiguration
boot or DR a default non-VLAN data-link will be created for the device.
The configuration of all objects will persist across reboot.
In future, 'dladm' and its private file where all persistant
information is stored ('/etc/datalink.conf') will be used to manage
device specific parameters which are currently managed via 'ndd',
driver specific configuration files and /etc/system.
7. Tuning for
performance:
The Solaris 10 stack is tuned to give steller out of box performance
irrespective of the H/W used. The secret lies in using techniques like
dynamically switching between interrupt vs polling mode which gives
very good latencies when load is managible by allowing the NIC to
interrupt per packet and switching to polling mode for better
throughput and well bounded latencies when load is very high. The
defaults are also carefully picked based on H/W configuration. For
instance, the 'tcp_conn_hash_size' tunable was very conservative pre
Solaris 10. The default value of 512 hash buckets was selected based on
lowest supperted configuration (in terms of memory). Solaris 10 looks
at the free memory at boot time to choose the value for
'tcp_conn_hash_size'. Similarly, when connection is 'reaped' from the
time wait state, the memory associated with the connection instance is
not freed instantly (again based on the total system memory available)
but instead put in a 'free_list'. When new connections arrive if a
given period, TCP tries to reuse memory from 'free_list' otehr wise
'free_list' is periodically cleaned up.
Inspite of these features, sometimes its necessary to tweak some
tunables to deal with extreme cases or specific workloads. We discuss
some tunables below that control the stack behaviour. Care should be
taken to understand the impact otherwise the system might become
unstable. Its important to note that for bulk of the applications and
workloads, the defaults will give the best results.
8. Future
The future direction of Solaris networking stack will continue to
build
on better vertical integration between layers which will improve
locality and performance further. With the advent of Chip
multithreading and multi core CPUs, the number of parallel execution
pipelines will continue to increase even on low end systems. A typical
2 CPU machine today is dual core providing 4 execution pipelines and
soon going to have hyperthreading as well.
The NICs are also becoming advanced offering multiple interrupts via
MSI-X, small classification capabilities, multiple DMA channels, and
various stateless offloads like large segment offload etc.
Future work will continue to leverage on these H/W trends including
support for TCP offload engines, Remote direct memory access (RDMA),
and iSCSI. Some other specific things that are being worked on:
- Network stack virtualization - With the industry wide trend of
server consolidation and running multiple virtual machines on same
physical instance, its important the Solaris stack can be virtualized
efficiently.
- B/W Resource control - The same trend thats driving network
virtualization is also driving the need to control the bandwidth usage
for various applications and virtual machines on same box efficiently.
- Support for high performance 3rd party modules - The current
Solaris 10 framework is still private to modules from Sun. STREAMs
based modules are the only option for the ISVs and they miss the full
potential of the new framework.
- Forwarding performance - Work is being done to further improve
the Solaris forwarding performance.
- Network security with performance - The world is becoming
increasing complex and hostile. Its not possible to choose between
performance and security anymore. Both are a requirement. Solaris was
always very strong in security and Solaris 10 makes great strides in
enabling security without sacrificing performance. Focus will continue
on enhancing IPfilter performance and functionality and a whole new
approach and detecting Denial of service attacks and dealing with them.
9. Acknowledgments
Many Thanks to Thirumalai Srinivasan, Adi Masputra, Nicolas Droux, and
Eric Cheng for contributing parts of this text. Also thanks are due to
all the members of Solaris networking community for their help.
(2005-11-14 23:37:31.0)
Permalink
|
|
|
|
|