Nitin's Weblog Nitin's Weblog

Monday Jan 12, 2009

Resource allocation for network processing using Crossbow features

As you may be aware, Crossbow provides features to allocate resources for network processing using dladm interface. Administrators can now assign a particular set of cpu's on the fly to restrict traffic over a certain interface stay constrained on only this set of cpu's.

I am going to use a T5220 sun4v server with 64 strands for this example. By default when the machine boots up, a default set of cpu's is allocated to the NIC. This allocation is done using a simple round robin scheme. However if the user wishes to change the allocation to a specific set of cpu's, one can do it using dladm interface as follows.

bash-3.2# dladm show-linkprop -p cpus e1000g1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
e1000g1      cpus            rw   --             --             --
bash-3.2#

No user defined set of cpu's assigned as from seen above.

bash-3.2# echo ::interrupts | mdb -k

        Device   Shared  Type    MSG #   State   INO     Mondo    Pil    CPU
      e1000g#4      no   MSI        5    enbl    0x21    0x21       6      0
        nxge#0      no   MSI-X      9    disbl   0x20    0x20       6     63
        nxge#0      no   MSI-X      8    disbl   0x1f    0x1f       6     62
        nxge#4      no   MSI-X      7    disbl   0x1e    0x1e       6     61
        nxge#4      no   MSI-X      6    disbl   0x1d    0x1d       6     60
      e1000g#3      no   MSI        4    enbl    0x1c    0x1c       6      0
      e1000g#2      no   MSI        3    enbl    0x1b    0x1b       6      0
      e1000g#1      no   MSI        2    enbl    0x1a    0x1a       6     18

Interrupt for e1000g1 for the time being associated with CPU 18.

Lets assign a set of cpu's to this NIC.

bash-3.2# dladm set-linkprop -p cpus=14,15,16,17,18,19,20,21 e1000g1
bash-3.2# dladm show-linkprop -p cpus e1000g1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
e1000g1      cpus            rw  14,15,16,17,18,19,20,21 -- --

The above command also retargets the interrupt assigned for the CPU to first
specified CPU in the list. Note, some platforms do not allow retargetting
a single interrupt, retargetting won't work for such platforms.

bash-3.2# echo ::interrupts | mdb -k

        Device   Shared  Type    MSG #   State   INO     Mondo    Pil    CPU
      e1000g#4      no   MSI        5    enbl    0x21    0x21       6      0
        nxge#0      no   MSI-X      9    disbl   0x20    0x20       6     63
        nxge#0      no   MSI-X      8    disbl   0x1f    0x1f       6     62
        nxge#4      no   MSI-X      7    disbl   0x1e    0x1e       6     61
        nxge#4      no   MSI-X      6    disbl   0x1d    0x1d       6     60
      e1000g#3      no   MSI        4    enbl    0x1c    0x1c       6      0
      e1000g#2      no   MSI        3    enbl    0x1b    0x1b       6      0
      e1000g#1      no   MSI        2    enbl    0x1a    0x1a       6     14

All the supporting threads including the interrupt are now confined to the set
of cpu's assigned by the user.  One can reset this static assignment using

bash-3.2# dladm reset-linkprop -p cpus e1000g1
bash-3.2# dladm show-linkprop -p cpus e1000g1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
e1000g1      cpus            rw   --             --             --

Above was a simple example of using Crossbow feature to perform resource allocation for network processing.


Technorati Tag: OpenSolaris

Monday Mar 10, 2008

Crossbow on CMT


Traditional uniprocessors are unable to drive high speed interfaces to link speed. Although there is significant improvement in uniprocessor performance, it is becoming increasingly unviable to push the limits on such uniprocessors. There is a large push towards Chip level multiprocesor architectures such as Niagara from Sun. Crossbow network stack is designed to take advantage of such CMT archictetures. It is based on assigning a group of connections to set of dedicated threads. These threads can then process messages in parallel taking advantage of the CMT architecture. Incoming packets are spread out to multiple rings and worker threads operate on these rings to process the packets. In order to demonstrate this point, I am going to show a simple example of incoming packet fanout as described above. After injecting packets on a Niagara platform, the fanout thread for UDP spreads traffic on 8 softrings. As you can see from the mpstat output below, there are 8 threads processing incoming traffic. In addition, there is a poll thread bringing in packets from the interface and a fanout thread spreading out packets to the softring.



CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 0    0   0   77   203  104    0    0    0    0    0     0    0   1   0  99
 1    0   0    0 14246 14245    0    0    0 26447    0     0    0  63   0  37
 2    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 3    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 4    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 5    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 6    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 7    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 8    0   0    0     3    0    3    0    0    0    0     3    0   0   0 100
 9    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
10    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
11    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
12    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
13    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
14    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
15    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
16    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
17    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
18    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
19    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
20    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
21    0   0 4509  3474    0 6945    0    0 10262    0     0    0  43   0  57
22    0   0 3461   516    0 6922    0    0 32105    0     0    0 100   0   0
23    0   0    0  1630    0 3570    0    0 61742    0     0    0  92   0   8
24    0   0    0    14    0   34    0    0 45917    0     0    0 100   0   0
25    0   0    0    49    0  132    0    0 45610    0     0    0 100   0   0
26    0   0    0    82    0  221    0    0 45120    0     0    0 100   0   0
27    0   0    0   103    0  270    0    0 44182    0     0    0  99   0   1
28    0   0    0   813    0 1926    0    0 60564    0     0    0  96   0   4
29    0   0    0   691    0 1652    0    0 60248    0     0    0  97   0   3
30    0   0    0   611    0 1447    0    0 59956    0     0    0  98   0   2
31    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
32    1   0    0     2    0    1    0    0    0    0   203    1   0   0  99
33    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
34    0   0    0     2    1    0    0    0    0    0     0    0   0   0 100
35    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
36    0   0    0     2    0    2    0    0    0    0     0    0   1   0  99
37    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
38    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
39    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
40    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
41    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
42    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
43    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
44    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
45    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
46    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
47    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
48    0   0   42    72    0  143    0    0    0    0     0    0   0   0 100
49    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
50    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
51    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
52    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
53    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
54    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
55    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
56    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
57    0   0    0     2    0    2    0    0    0    0     0    0   0   0 100
58    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
59    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
60    0   0    0     2    0    3    0    0    0    0     1    0   0   0 100
61    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
62    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
63    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100



The interrupt thread signals poll thread on incoming packets. The poll thread disables interrupt and tries to bring in as many packets as it could. This kind of inter-weaving between poll and interrupt continues throughout the system operation. This is a simple illustration of Crossbow utilising CMT architecture to drive network load.

Tuesday Jun 14, 2005

IPoIB on OpenSolaris

Shining, new OpenSolaris is here!. This is undoubtedly very exciting to share the efforts with everyone around. Here is some code internals and information about configuring IPoIB node over Infiniband.

IPoIB is a DLPI based OpenSolaris driver. It supports unconnected datagram mode over IB. The IPoIB mac address is 20 bytes long. The address is comprised of GID and QPN; GID is relatively stable while QPN can change across boot. During the init operation, IPoIB driver will try to join the broadcast group. It is expected of any Subnet Manager to create this group before the join operation takes place.

Here is an example of how to configure IPoIB node. IPoIB supports most of the common Subnet Managers. Make sure the driver is present and add_drv'ed.

bash-3.00# modinfo | grep ibd
 79 7bfd8000   8ae4   -   1  ibdm (InfiniBand Device Manager 1.18)
165 7bf66000   7070 232   1  ibd (InfiniBand DLPI Driver 1.17)
bash-3.00# cat /etc/path_to_inst | grep ibd
"/pci@8,600000/pci@1/pci15b3,5a44@0/ibport@1,7fff,ipib" 0 "ibd"
"/pci@8,600000/pci@1/pci15b3,5a44@0/ibport@1,8001,ipib" 1 "ibd"
"/pci@8,600000/pci@1/pci15b3,5a44@0/ibport@2,7fff,ipib" 2 "ibd"
"/pci@8,600000/pci@1/pci15b3,5a44@0/ibport@2,8001,ipib" 3 "ibd"

Partition key 8001 is a valid user level pkey which we can use to plumb the ibd interface. Codewise, the ibd driver init routine will do a query on the broadcast group. On success it records the MTU size and then joins this group. Code shown as follows (only the query part):


static ibt_status_t
ibd_find_bgroup(ibd_state_t *state)
{
	ibt_mcg_attr_t mcg_attr;
	uint_t numg;
	uchar_t scopes[] = { IB_MC_SCOPE_SUBNET_LOCAL,
	    IB_MC_SCOPE_SITE_LOCAL, IB_MC_SCOPE_ORG_LOCAL,
	    IB_MC_SCOPE_GLOBAL };
	int i, mcgmtu;
	boolean_t found = B_FALSE;

	bzero(&mcg_attr, sizeof (ibt_mcg_attr_t));
	mcg_attr.mc_pkey = state->id_pkey;
	state->id_mgid.gid_guid = IB_MCGID_IPV4_LOW_GROUP_MASK;

	for (i = 0; i < sizeof (scopes)/sizeof (scopes[0]); i++) {
		state->id_scope = mcg_attr.mc_scope = scopes[i];

		/*
		 * Look for the IPoIB broadcast group.
		 */
		state->id_mgid.gid_prefix =
		    (((uint64_t)IB_MCGID_IPV4_PREFIX << 32) |
		    ((uint64_t)state->id_scope << 48) |
		    ((uint32_t)(state->id_pkey << 16)));
		mcg_attr.mc_mgid = state->id_mgid;
		if (ibt_query_mcg(state->id_sgid, &mcg_attr, 1,
		    &state->id_mcinfo, &numg) == IBT_SUCCESS) {
			found = B_TRUE;
			break;
		}

	}

	if (!found) {
		ibd_print_warn(state, "IPoIB broadcast group absent");
		return (IBT_FAILURE);
	}

	/*
	 * Assert that the mcg mtu <= id_mtu. Fill in updated id_mtu.
	 */
	mcgmtu = (128 << state->id_mcinfo->mc_mtu);
	if (state->id_mtu < mcgmtu) {
		ibd_print_warn(state, "IPoIB broadcast group MTU %d "
		    "greater than port's maximum MTU %d", mcgmtu,
		    state->id_mtu);
		return (IBT_FAILURE);
	}
	state->id_mtu = mcgmtu;

	return (IBT_SUCCESS);
}

We now plumb the interface, assign an IP address to our interface and start using it.

bash-3.00# ifconfig ibd3 plumb
bash-3.00# ifconfig ibd3 192.168.1.1 up
bash-3.00# ifconfig ibd3
ibd3: flags=1000843 mtu 2044 index 3
        inet 192.168.1.1 netmask ffffff00 broadcast 192.168.1.255
        ipib 0:0:4:9:0:0:0:0:0:0:12:34:0:2:c9:1:8:fe:cc:62
bash-3.00# arp -a
Net to Media Table: IPv4
Device   IP Address               Mask      Flags   Phys Addr
------ -------------------- --------------- ----- ---------------
ibd3   224.0.0.2            255.255.255.255       00:ff:ff:ff:ff:12:40:1b:80:01:00:00:00:00:00:00:00:00:00:02 
bash-3.00# ping -i ibd3 sun.com
sun.com is alive


Technorati Tag:

Tuesday Oct 19, 2004

I am an engineer in the NSG organization working on various IB projects, including:
  • IPoIB
  • SDP
  • OpenIB efforts
  • Network Performance