Steffen
Thursday Aug 20, 2009
My test system is running build 05 of the upcoming Solaris 10 10/09 (update 8). The system has four bge interfaces, and I will use numbers 1 and 2. (This should work just as well with previous updates of Solaris 10, and with Sun Trunking in Solaris 9, except for the zones parts. I am using zones just to isolate my traffic generation and easily get it to use a specific data link.)
Starting out things like like this.
global# dladm show-dev
bge0 link: up speed: 1000 Mbps duplex: full
bge1 link: unknown speed: 0 Mbps duplex: unknown
bge2 link: unknown speed: 0 Mbps duplex: unknown
bge3 link: unknown speed: 0 Mbps duplex: unknown
global# dladm show-link
bge0 type: non-vlan mtu: 1500 device: bge0
bge1 type: non-vlan mtu: 1500 device: bge1
bge2 type: non-vlan mtu: 1500 device: bge2
bge3 type: non-vlan mtu: 1500 device: bge3
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
ether 0:3:ba:e3:42:8b
I have my switch set up to aggregate ports 1 and 2, and here is how I do it with Solaris 10.
global# dladm create-aggr -d bge1 -d bge2 1 global# dladm show-link bge0 type: non-vlan mtu: 1500 device: bge0 bge1 type: non-vlan mtu: 1500 device: bge1 bge2 type: non-vlan mtu: 1500 device: bge2 bge3 type: non-vlan mtu: 1500 device: bge3 aggr1 type: non-vlan mtu: 1500 aggregation: key 1VLAN tagged interfaces are used by accessing the underlying data link by preceeding the data link ID with the VLAN tag. For bge1 and VLAN 111 that would be bge111001. For for aggr1 it would be aggr111001.
For this setup I am using zones zone111 and zone112 configured as an exclusive IP Instance. The zone configuration look like this.
global# zonecfg -z zone111 info
zonename: zone111
zonepath: /zones/zone111
brand: native
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address not specified
physical: aggr111001
defrouter not specified
Once configured, installed, and booted, the network configuration of zone111 is:
global# zlogin zone111 ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
aggr111001: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
inet 172.16.111.141 netmask ffffff00 broadcast 172.16.111.255
ether 0:3:ba:e3:42:8c
Turns out that configuring this was easy compared to showing that the link aggregation was really working. While the full list of links known when the zones are includes the aggregation and the VLANs on the aggregation, tools such a netstat or nicstat would not include them. As it turns out they only report on interfaces that are plumbed up in that IP Instance. It will not be possible to plumb either bge1 or bge2 since they are members of the aggregation.
global# dladm show-link bge0 type: non-vlan mtu: 1500 device: bge0 bge1 type: non-vlan mtu: 1500 device: bge1 bge2 type: non-vlan mtu: 1500 device: bge2 bge3 type: non-vlan mtu: 1500 device: bge3 aggr1 type: non-vlan mtu: 1500 aggregation: key 1 aggr111001 type: vlan 111 mtu: 1500 aggregation: key 1 aggr112001 type: vlan 112 mtu: 1500 aggregation: key 1 global# netstat -i Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue lo0 8232 loopback localhost 98 0 98 0 0 0 bge0 1500 pinebarren pinebarren 43101 0 7181 0 0 0So I ended up using kstat(1M) to get the values of the number of outbound packets. I an interested in outbound as that is what Solaris can affect regarding distributing traffic across links in an aggregation--the switch determines that for inbound traffic.
This example shows data on instance 2 of the bge interface for kstat value opackets.
global# kstat -m bge -i 2 -s opackets
module: bge instance: 2
name: mac class: net
opackets 2542
With kstat I can see that for different connections either bge1 or bge2 has packets going out on it. A good test for me was scp to a remote system. Neither ping nor traceroute caused the necessary hashing to use both links in the aggregation. Steffen
Wednesday Jun 17, 2009
Separately, the UltraSPARC® T2 processor in the T-series (CMT) has built-in cyptographic processors (one per core, or typically eight per socket) that accelerate secure one-way hashes, public key session establishment, and private key bulk data transfers. The latter is useful for long standing connections and for larger data operations, such as a file transfer.
Prior to Solaris 10 5/09, an scp or sftp file transfer operation had the encryption and decryption done the by the CPU. While usually this is not a big deal, as most CPUs do private key crypto reasonably fast, on the CMT systems these operations are relatively slow. Now with SunSSH With OpenSSL PKCS#11 Engine Support in 5/09, the SunSSH server and client will use the cryptographic framework when an UltraSPARC® T2 process nc2p cryptographic unit is available.
To demonstrate this, I used a T5120 with Logical Domains (LDoms) 1.1 configured running Solaris 10 5/09. Using LDoms helps, as I can assign or remove crypto units on a per-LDom basis. (Since the crypto units are not supported yet with dynamic reconfiguration, a reboot of the LDom instance is required. However, in general, I don't see making that kind of change very often.)
I did all the work in the 'primary' control and service LDom, where I have direct access to the network devices, and can see the LDom configuration. I am listing parts of it here, although this is about Solaris, SunSSH, and the crypto hardware.
medford# ldm list-bindings primary
NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME
primary active -n-cv- SP 16 8G 0.1% 22h 16m
MAC
00:14:4f:ac:57:c4
HOSTID
0x84ac57c4
VCPU
VID PID UTIL STRAND
0 0 0.6% 100%
1 1 1.9% 100%
2 2 0.0% 100%
3 3 0.0% 100%
4 4 0.0% 100%
5 5 0.1% 100%
6 6 0.0% 100%
7 7 0.0% 100%
8 8 0.7% 100%
9 9 0.1% 100%
10 10 0.0% 100%
11 11 0.0% 100%
12 12 0.0% 100%
13 13 0.0% 100%
14 14 0.0% 100%
15 15 0.0% 100%
MAU
ID CPUSET
0 (0, 1, 2, 3, 4, 5, 6, 7)
1 (8, 9, 10, 11, 12, 13, 14, 15)
MEMORY
RA PA SIZE
0x8000000 0x8000000 8G
The 'system' has 16 CPUs (hardware strands), two MAUs (those are the crypto units), and 8 GB of memory. I am using e1000g0 for the network and the remote system is a V210 running Solaris Express Community Edition snv_113 SPARC (OK, I am a little behind). The network is 1 GbE.
The command I run is
source#/usr/bin/time scp -i /.ssh/destination /large-file destination:/tmp source# du -h /large-file 1.3G /large-fileMy results with the crypto units were
real 1:13.6 user 32.2 sys 34.5while without the crypto units
real 2:28.2 user 2:10.9 sys 26.8The transfer took one half the time and considerably less CPU processing with the crypto units in place (I have two although I think it is using only one since this is a single transfer).
So, SunSSH benefits from the built-in cryptographic hardware in the UltraSPARC® T2 process!
Steffen
Monday Jun 01, 2009
Network virtualization joins a number of other features already in OpenSolaris, such as vanity naming (allowing custom names for data links), snooping on loopback for better observability, a re-architected IPMP with an administrative interface, and Network Automagic (NWAM--automatic configuration of desktop networking based on available wired and wireless network services).
Congratulations to everyone who made all this possible!
Steffen PS: Regarding the fully supported, please notice the new support prices and durations!
Thursday Apr 16, 2009
This tool was instrumental in increasing my understanding of what was going on with the customer's system, and removed the need to wait for output via emails or just trying to understand things over the phone. The owner of the session, usually the customer, has the option of allowing you to enter commands (without hitting 'Return'), or even allowing the 'Return' as well. It also has logging and chatting capabilities.
When first logging in, it allows you to be the owner of the shell and share that with other participants, or to view someone else's shell session.
Once logged in, you have a terminal window, the people present on the connection, and a chat window. The icon before the name/email address shows whether you have view, type, or full control (the keyboard will also have a down-arrow with it).
Oh, and I forgot about the feature to scribble on the screen. I used that to diagram out an idea I had to solve a zone networking issue, and it helped the others understand what I was proposing a lot quicker!
In the spirit of 'asking for what you want instead of complaining about what you don't have', I submitted a few suggestions, and the owner(s) quickly responded with clarifications.
I see this as a great tool to help future cases where a shared view of operations will improve understanding or service delivery! Thanks to those who created and maintain it! Steffen
He had been contacted to help a customer who was ready to deploy a web application, and they were experiencing intermittent lack of connection to the web site. Interestingly, they were also using zones, a bunch of them (OK, a handful)--and so right up my alley.
The customer was running a multi-tiered web application on an x4600 (so Solaris on x86 as well!), with the web server, web router, and application tiers in different zones. They were using shared IP Instances, so all the network configuration was being done in the global zone.
Initially, we had to modify some configuration parameters, especially regarding default routes. Since the system was installed with Solaris 10 5/08 and had more recent patches, we could use the defrouter feature introduced in 10/08 to make setting up routes for the non-global zones a little easier. This was needed because the global zone was using only one NIC, and it was not going to be on the networks that the non-global zones were on.
What made the configuration a little unique was that the web server needs a default router to the Internet, while the application server needs a route to other systems behind a different router. Individually, everything is fine. However, the web1 zone also needs to be on the network that the application and web router are on, so it ends up having two interfaces.
Lets look at web1 when only it is running.
web1# ifconfig -a4
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 172.16.1.41 netmask ffffff00 broadcast 172.16.1.255
bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
inet 192.168.51.41 netmask ffffff00 broadcast 192.168.51.255
web1# netstat -rn
Routing Table: IPv4 Destination Gateway Flags Ref Use Interface -------------------- -------------------- ----- ----- ---------- --------- default 172.16.1.1 UG 1 0 bge1 172.16.1.0 172.16.1.41 U 1 0 bge1:1 192.168.51.0 192.168.51.41 U 1 0 bge2:1 224.0.0.0 172.16.1.41 U 1 0 bge1:1 127.0.0.1 127.0.0.1 UH 5 34 lo0:1
The zone is on two interface, bge1 and bge2, and has a default route that uses bge1. However, when zone app1 is running, there is a second default route, on bge2. The same is true if app2 or odr are running. Note that these three zones are only on bge2.
app1# ifconfig -a4
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
bge2:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
inet 192.168.51.43 netmask ffffff00 broadcast 192.168.51.255
app1# netstat -rn
Routing Table: IPv4 Destination Gateway Flags Ref Use Interface -------------------- -------------------- ----- ----- ---------- --------- default 192.168.51.1 UG 1 0 bge2 192.168.51.0 192.168.51.43 U 1 0 bge2:1 224.0.0.0 192.168.51.43 U 1 0 bge2:1 127.0.0.1 127.0.0.1 UH 3 51 lo0:1
In the meantime, this is what happens in web1.
web1# netstat -rn Routing Table: IPv4 Destination Gateway Flags Ref Use Interface -------------------- -------------------- ----- ----- ---------- --------- default 192.168.51.1 UG 1 0 bge2 default 172.16.1.1 UG 1 0 bge1 172.16.1.0 172.16.1.41 U 1 0 bge1:1 192.168.51.0 192.168.51.41 U 1 0 bge2:4 224.0.0.0 172.16.1.41 U 1 0 bge1:1 127.0.0.1 127.0.0.1 UH 6 132 lo0:4
With any of the other zones running, web1 now has two default routes. And it only happens in web1, as it is the only zone with its public facing data link bge1 and a shared data link (bge2).
Traffic to any system on either the 192.168.51.0 or 172.16.1.1 network will have no issues. Every time IP needs to determine a new path for a system not on either of those two networks, it will pick a route, and it will round-robin between the two default routes. Thus approximately half the time, connections will fail to establish, or possibly existing connections will not work if they have been idle for a while.
This is how IP is supposed to work, so there is technically nothing wrong. It is a features of zones and a shared IP Instance. [2009.06.23: For background on why IP works this way, see James' blog].
The only problem is that this is not what the customer wants!
One option would be to force all traffic between the web and application tier out the bge1 interface, putting it on the wire. This may not be desirable for security reasons, and introduces latencies since traffic now goes on the wire. Another option would be to use exclusive IP Instances for the web servers. For each web zone, and this example only has one, it would required two additional data links (NICs). That would add up. Also, this configuration is targeted to be used with Solaris Cluster's scalable services, and those must be in shared IP Instance zones. Hummm....as I like to say.
We didn't know about the shared IP Instance restriction of Solaris Cluster, and as the customer was considering how they were going to add additional NICs to all the systems, something slowly developed in my mind. How about creating a shared, dummy network between the web and application tier? They had one spare NIC, and with shared IP it does not even need to be connected to a switch port, since IP will loop all traffic back anyway!
The more I thought about it, the more I liked it, and I could not see anything wrong with it. At least not technically as I understood Solaris. Operationally, for the customer, it might be a little awkward.
Here is what I was thinking of...
With this configuration the web1 zone has a default router only to the Internet and it can reach odr, and if necessary, app1 and app2, directly via the new network. And app1 and app2 only have a single default route to get to the Intranet. The nice thing is that bge3 does not even need to be up. That is visible with ifconfig output, where bge3 is not showing a RUNNING flag, which indicates the port is not connected (or in my case has been disabled on the switch).
global# ifconfig -a4
...
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
ether 0:3:ba:e3:42:8b
bge1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 0.0.0.0 netmask 0
ether 0:3:ba:e3:42:8c
bge2: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
inet 0.0.0.0 netmask 0
ether 0:3:ba:e3:42:8d
bge3: flags=1000802<BROADCAST,MULTICAST,IPv4> mtu 1500 index 5
inet 0.0.0.0 netmask 0
ether 0:3:ba:e3:42:8e
...
And within web1 there is now only one default route.
web1# netstat -rn Routing Table: IPv4 Destination Gateway Flags Ref Use Interface -------------------- -------------------- ----- ----- ---------- --------- default 172.16.1.1 UG 1 17 bge1 172.16.1.0 172.16.1.41 U 1 2 bge1:1 192.168.52.0 192.168.52.41 U 1 2 bge3:1 224.0.0.0 172.16.1.41 U 1 0 bge1:1 127.0.0.1 127.0.0.1 UH 4 120 lo0:1In the customer's case, multiple systems were being used, so the private networks were connected together so that a web zone on one system could access an odr zone on another. I am showing the simple, single system case since it is so convenient.
If I were using Solaris Express Community Edition (SX-CE) or OpenSolaris 2009.06 Developer Builds, with the Crossbow bits and virtual NICs (VNICs) available, I wouldn't even have needed to use that physical interface. Both are available here.
I hope this trick might help others out in the future.
Steffen
Tuesday Apr 14, 2009
Having one IP address (whether a public or a private, non routable) per data link and also the separate address(es) for the application(s) turns out to be a lot of addresses to allocate and administer. And since the default of five probes spaced two seconds apart meant a failure would take at least ten (10) seconds to be detected, something more was needed.
So in the Solaris 9 timeframe the ability to also do link based failure detection was delivered. It requires specific NICs whose driver has the ability to notify the system that a link has failed. The Introduction to IPMP in the Solaris 10 Systems Administrators Guide on IP Services lists the NICs that support link state notification. Solaris 10 supports configuring IPMP with only link based failure detection.
global# more /etc/hostname.bge[12] :::::::::::::: /etc/hostname.bge1 :::::::::::::: 10.1.14.140/26 group ipmp1 up :::::::::::::: /etc/hostname.bge2 :::::::::::::: group ipmp1 standby upOn system boot, there will be an indication on the console that since no test addresses are defined, probe-based failure detection is disabled.
Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge2; disabling probe-based failure detection on it Apr 10 10:57:20 in.mpathd[168]: No test address configured on interface bge1; disabling probe-based failure detection on itLooking at the interfaces configured,
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
ether 0:3:ba:e3:42:8b
bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
groupname ipmp1
ether 0:3:ba:e3:42:8c
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
bge2: flags=69000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 0 index 4
inet 0.0.0.0 netmask 0
groupname ipmp1
ether 0:3:ba:e3:42:8d
you will notice that two of the three interfaces have no address (0.0.0.0). Also, the data address is on a physical interface on bge1. At the same time bge2 has the 0.0.0.0 address. On the failure of bge1,
Apr 10 14:34:53 global bge: NOTICE: bge1: link down
Apr 10 14:34:53 global in.mpathd[168]: The link has gone down on bge1
Apr 10 14:34:53 global in.mpathd[168]: NIC failure detected on bge1 of group ipmp1
Apr 10 14:34:53 global in.mpathd[168]: Successfully failed over from NIC bge1 to NIC bge2
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
ether 0:3:ba:e3:42:8b
bge1: flags=19000802<BROADCAST,MULTICAST,IPv4,NOFAILOVER,FAILED> mtu 0 index 3
inet 0.0.0.0 netmask 0
groupname ipmp1
ether 0:3:ba:e3:42:8c
bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname ipmp1
ether 0:3:ba:e3:42:8d
bge2:1: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
inet 10.1.14.140 netmask ffffffc0 broadcast 10.1.14.191
the data address is migrated onto bge2:1. I find this a little confusing. However, I don't know any way around it on Solaris 10. The IPMP Re-architecture makes this a lot easier!
Using Probe-based IPMP with non-global zones
Configuring a shared IP Instance non-global zone and utilizing IPMP managed in the global zone is very easy.The IPMP configuration is very simple. Interface bge1 is active, and bge2 is in stand-by mode.
global# more /etc/hostname.bge[12] :::::::::::::: /etc/hostname.bge1 :::::::::::::: group ipmp1 up :::::::::::::: /etc/hostname.bge2 :::::::::::::: group ipmp1 standby upMy zone configuration is:
global# zonecfg -z zone1 info
zonename: zone1
zonepath: /zones/zone1
brand: native
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: shared
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address: 10.1.14.141/26
physical: bge1
Prior to booting, the network configuration is:
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone zone1
inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
ether 0:3:ba:e3:42:8b
bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname ipmp1
ether 0:3:ba:e3:42:8c
bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname ipmp1
ether 0:3:ba:e3:42:8d
After booting, the network looks like this:
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone zone1
inet 127.0.0.1 netmask ff000000
bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 129.154.53.125 netmask ffffff00 broadcast 129.154.53.255
ether 0:3:ba:e3:42:8b
bge1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname ipmp1
ether 0:3:ba:e3:42:8c
bge1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
zone zone1
inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
bge2: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname ipmp1
ether 0:3:ba:e3:42:8d
So a simple case for the use of IPMP, without the need for test addresses! Other IPMP configurations, such as more than two data links, or active-active, are also supported with link based failure detection. The more links involved, the more test addresses are saved with link based failure detection. Since writing this entry I was involved in a customer configuration where this is saving several hundred IP address and their management (such as avoiding duplicate address). That customer is willing to forgo the benefit of probes testing past the local switch port.
Steffen
Monday Jan 19, 2009
Since I was already playing with build 105 because the Crossbow features are now integrated, I decided to apply the IPMP bits to a 105 installation. [Note: The IPMP Re-architecture is expected to be in Solaris Express Community Edition (SX-CE) build 107 or so (due to be out early Feb 2009), and thus in OpenSolaris 2009.spring (I don't know what its final name will be. Early access to IPS packages for OpenSolaris 2008.11 should appear in the bi-weekly developer repository shortly after SX-CE has the feature included. There is no intention to back port the re-architecture to Solaris 10.]
I am impressed! The bits worked right away, and once I got used to the slightly different way of monitoring IPMP, I really liked what I saw.
Being accustomed to using IPMP on Solaris 10 and with Crossbow beta testing previous Nevada bits, I used the long-standing (Solaris 10 and prior) IPMP configuration style I am used to. For my testing, I am using link failure testing only, so no probe addresses are configured. [For examples of the new configuration format, see the section Using the New IPMP Configuration Style below. (15 Feb 2009)]
global# cat /etc/hostname.bge1 group shared global# cat /etc/hostname.bge2 group shared global# cat /etc/hostname.bge3 group shared standbyIn my test case bge1 and bge2 are active interfaces, and bge3 is a standby interface.
global# ifconfig -a4
bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
ether 0:3:ba:e3:42:8b
bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8c
bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8d
bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 5
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8e
ipmp0: flags=8201000842<BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
inet 0.0.0.0 netmask 0
groupname shared
You will notice that all three interfaces are up and part of group shared. What is different from the old IPMP is that automatically another interface was created, with the flag IPMP. This is the interface that will be used for all the data IP addresses. Because I used the old format for the /etc/hostname.* files, the backward compatibility of the new IPMP automatically created the ipmp0 interface and assigned it a name. If I wish to have control over that name, I must configure IPMP slightly differently. More on that later.
The new command ipmpstat(1M) is also introduced to get enhanced information regarding the IPMP configuration.
My test is really about using zones and IPMP, so here is what things look like when I bring up three zones that are also configured the traditional way, with network definitions using the bge interfaces. [Using the new format, I would replace bge with either ipmp0 (keep in mind that 0 (zero) is set dynamically) or shared. For more details on the new format, go to Using the New IPMP Configuration Style below. (15 Feb 2009)]
global# for i in 1 2 3 ^Jdo^J zonecfg -z shared${i} info net ^Jdone
net:
address: 10.1.14.141/26
physical: bge1
defrouter: 10.1.14.129
net:
address: 10.1.14.142/26
physical: bge1
defrouter: 10.1.14.129
net:
address: 10.1.14.143/26
physical: bge2
defrouter: 10.1.14.129
After booting the zones, note that the zones' IP addresses are on logical interfaces on ipmp0, not the previous way of being logical interfaces on bge.
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone shared1
inet 127.0.0.1 netmask ff000000
lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone shared2
inet 127.0.0.1 netmask ff000000
lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone shared3
inet 127.0.0.1 netmask ff000000
bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
ether 0:3:ba:e3:42:8b
bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8c
bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8d
bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 5
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8e
ipmp0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
zone shared1
inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
groupname shared
ipmp0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
zone shared2
inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
ipmp0:2: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
zone shared3
inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
For address information, here are the pre and post boot ipmpstat outputs.
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 0.0.0.0 down ipmp0 -- --
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.143 up ipmp0 bge1 bge2 bge1 10.1.14.142 up ipmp0 bge2 bge2 bge1 10.1.14.141 up ipmp0 bge1 bge2 bge1What's really neat is that it shows which interface(s) are used for outbound traffic. A different interface will be selected for each new remote IP address. That is the level of outbound load spreading at this time.
global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES ipmp0 shared ok -- bge2 bge1 (bge3)There is no group difference before or after.
global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES ipmp0 shared ok -- bge2 bge1 (bge3)The FDT column lists the probe-based failure detection time, and is empty since that is disabled in this setup. bge3 is listed third and in parenthesis since that interface is not being used for data traffic at this time.
global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 no ipmp0 is----- up disabled ok bge2 yes ipmp0 ------- up disabled ok bge1 yes ipmp0 --mb--- up disabled okAlso, there are no differences for interface status. In both cases bge1 is used from multicast and broadcast traffic, and bge3 is inactive and in standby mode.
global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 no ipmp0 is----- up disabled ok bge2 yes ipmp0 ------- up disabled ok bge1 yes ipmp0 --mb--- up disabled okThe probe and target output is uninteresting in this setup as I don't have probe based failure detection on. I am including them for completeness.
global# ipmpstat -p ipmpstat: probe-based failure detection is disabled global# ipmpstat -t INTERFACE MODE TESTADDR TARGETS bge3 disabled -- -- bge2 disabled -- -- bge1 disabled -- --So lets see what happens on a link 'failure' as I turn of the switch port going to bge1.
On the console, the indication is a link failure.
Jan 15 14:49:07 global in.mpathd[210]: The link has gone down on bge1 Jan 15 14:49:07 global in.mpathd[210]: IP interface failure detected on bge1 of group sharedThe various ipmpstat outputs reflect the failure of bge1 and failover to to bge3, which had been in standby mode, and to bge2. I had expected both IP addresses to end up on bge3. Instead, IPMP determines how to best spread the IPs across the available interfaces.
The address output shows that .141 and .143 are now on bge3.
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.143 up ipmp0 bge3 bge3 bge2 10.1.14.142 up ipmp0 bge2 bge3 bge2 10.1.14.141 up ipmp0 bge2 bge3 bge2The group status has changed, with bge1 now shown in brackets as it is in failed mode.
global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES ipmp0 shared degraded -- bge3 bge2 [bge1]The interface status makes it clear that bge1 is down. Broadcast and multicast is now handed by bge2.
global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 yes ipmp0 -s----- up disabled ok bge2 yes ipmp0 --mb--- up disabled ok bge1 no ipmp0 ------- down disabled failedAs expected, the only difference in the ifconfig output is for bge1, showing that it is in failed state. The zones are continue to shown using the ipmp0 interface. This took me a little bit of getting used to. Before, ifconfig was sufficient to fully see what the state is. Now, I must use ipmpstat as well.
global# ifconfig -a4
...
bge1: flags=211000803<UP,BROADCAST,MULTICAST,IPv4,FAILED,CoS> mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8c
bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8d
bge3: flags=221000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,CoS> mtu 1500 index 5
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared
ether 0:3:ba:e3:42:8e
ipmp0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
zone shared1
inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
groupname shared
ipmp0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
zone shared2
inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
ipmp0:2: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 6
zone shared3
inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
"Repairing" the interface, things return to normal.
Jan 15 15:13:03 global in.mpathd[210]: The link has come up on bge1 Jan 15 15:13:03 global in.mpathd[210]: IP interface repair detected on bge1 of group sharedNote here only one IP address ended up getting moved back to bge1.
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.143 up ipmp0 bge1 bge2 bge1 10.1.14.142 up ipmp0 bge2 bge2 bge1 10.1.14.141 up ipmp0 bge2 bge2 bge1Interface bge3 is back in standby mode.
global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES ipmp0 shared ok -- bge2 bge1 (bge3)All three interfaces are up, only two are active, and broadcast and multicast stayed on bge2 (no need to change that now).
global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 no ipmp0 is----- up disabled ok bge2 yes ipmp0 --mb--- up disabled ok bge1 yes ipmp0 ------- up disabled okAs a further example of rebalancing of the IP address, here is what happens with four IP addresses spread across two interfaces.
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.144 up ipmp0 bge2 bge2 bge1 10.1.14.143 up ipmp0 bge1 bge2 bge1 10.1.14.142 up ipmp0 bge2 bge2 bge1 10.1.14.141 up ipmp0 bge1 bge2 bge1 Jan 15 16:19:09 global in.mpathd[210]: The link has gone down on bge1 Jan 15 16:19:09 global in.mpathd[210]: IP interface failure detected on bge1 of group shared global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.144 up ipmp0 bge2 bge3 bge2 10.1.14.143 up ipmp0 bge3 bge3 bge2 10.1.14.142 up ipmp0 bge2 bge3 bge2 10.1.14.141 up ipmp0 bge3 bge3 bge2 Jan 15 18:11:35 global in.mpathd[210]: The link has come up on bge1 Jan 15 18:11:35 global in.mpathd[210]: IP interface repair detected on bge1 of group shared global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.144 up ipmp0 bge2 bge2 bge1 10.1.14.143 up ipmp0 bge1 bge2 bge1 10.1.14.142 up ipmp0 bge2 bge2 bge1 10.1.14.141 up ipmp0 bge1 bge2 bge1There is even spreading of the IP addresses across any two active interfaces.
Using the New IPMP Configuration Style
In the previous examples, I used the old style of configuring IPMP with the /etc/hostname.xyzN files. Those files should work on all older versions of Solaris as well as with the re-architecture bits. This section briefly covers the new format.A new file that is introduced is the hostname.ipmp-group configuration file. It must follow the same format as any other data link configuration, ASCII characters followed by a number. I will use the same group name as above; however, I have to add a number to the end--thus the group name will be shared0. If you don't have the trailing number, the old style of IPMP setup will be used.
I create a file to define the IPMP group. Note that it contains only the keyword ipmp.
global# cat /etc/hostname.shared0 ipmpThe other files for the NICs reference the IPMP group name.
global# cat /etc/hostname.bge1 group shared0 up global# cat /etc/hostname.bge2 group shared0 up global# cat /etc/hostname.bge3 group shared0 standby upOne note that may not be obvious. I am not using the keyword -failover as I am not using test addresses. Thus the interfaces are also not listed as deprecated in the ifconfig output.
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
shared0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
ether 0:3:ba:e3:42:8b
bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
ether 0:3:ba:e3:42:8c
bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 5
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
ether 0:3:ba:e3:42:8d
bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 6
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
ether 0:3:ba:e3:42:8e
After booting the zones, which are still configured to use bge1 or bge2, things look like this.
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone shared1
inet 127.0.0.1 netmask ff000000
lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone shared2
inet 127.0.0.1 netmask ff000000
lo0:3: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
zone shared3
inet 127.0.0.1 netmask ff000000
shared0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
shared0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
zone shared1
inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
shared0:2: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
zone shared2
inet 10.1.14.142 netmask ffffffc0 broadcast 10.1.14.191
shared0:3: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
zone shared3
inet 10.1.14.143 netmask ffffffc0 broadcast 10.1.14.191
bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
ether 0:3:ba:e3:42:8b
bge1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
ether 0:3:ba:e3:42:8c
bge2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 5
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
ether 0:3:ba:e3:42:8d
bge3: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 6
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
ether 0:3:ba:e3:42:8e
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.143 up shared0 bge1 bge2 bge1 10.1.14.142 up shared0 bge2 bge2 bge1 10.1.14.141 up shared0 bge1 bge2 bge1 0.0.0.0 up shared0 -- -- global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES shared0 shared0 ok -- bge2 bge1 (bge3) global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 no shared0 is----- up disabled ok bge2 yes shared0 ------- up disabled ok bge1 yes shared0 --mb--- up disabled okThings are the same as before, except that the I now have specified the IPMP group name (shared0 instead of the previous ipmp0). I find this very useful as the name can help identify the purpose, and when debugging, different IPMP group names using context appropriate text should be very helpful.
I find the integration, or rather the backward compatibility, great. Not only will the old or existing IPMP setup work, the existing zonecfg network setup works as well. This means the same configuration files will work pre- and post-re-architecture!
Let's take a look at how things look within a zone.
shared1# ifconfig -a4
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
shared0:1: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
inet 10.1.14.141 netmask ffffffc0 broadcast 10.1.14.191
shared1# netstat -rnf inet Routing Table: IPv4 Destination Gateway Flags Ref Use Interface -------------------- -------------------- ----- ----- ---------- --------- default 10.1.14.129 UG 1 2 shared0 10.1.14.128 10.1.14.141 U 1 0 shared0:1 127.0.0.1 127.0.0.1 UH 1 33 lo0:1The zone's network is on the link shared0 using a logical IP, and everything else looks as it has always looked. This output is actually while bge1 is down. IPMP hides all the details in the non-global zone.
Using Probe-based Failover
The configurations so far have been with link-based failure detection. IPMP has the ability to do probe-based failure detection, where ICMP packet are sent to other nodes on the system. This allows for failure detection way beyond what link-based detection can do, including the whole switch, and items past it up to and including routers. In order to use probe-based failure detection, test addresses are required on the physical NICs. For my configuration, I use test addresses on a completely different subnet, and my router is another system running Solaris 10. The router happens to be a zone with two NICs and configured as an exclusive IP Instance.I am using a completely different subnet as I want to isolate the global zone from the non-global zones, and the setup is also using the defrouter zonecfg option, and I don't want to interfere with that setup.
The IPMP setup is as follows. I have added test addresses on the 172.16.10.0/24 subnet, and the interfaces are set to not fail over.
global# cat /etc/hostname.shared0 ipmp global# cat /etc/hostname.bge1 172.16.10.141/24 group shared0 -failover up global# cat /etc/hostname.bge2 172.16.10.142/24 group shared0 -failover up global# cat /etc/hostname.bge3 172.16.10.143/24 group shared0 -failover standby upThis is the state of the system before bringing up any zones.
global# ifconfig -a4
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
shared0: flags=8201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS,IPMP> mtu 1500 index 2
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
groupname shared0
bge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
inet 139.164.63.125 netmask ffffff00 broadcast 139.164.63.255
ether 0:3:ba:e3:42:8b
bge1: flags=209040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,CoS> mtu 1500 index 4
inet 172.16.10.141 netmask ffffff00 broadcast 172.16.10.255
groupname shared0
ether 0:3:ba:e3:42:8c
bge2: flags=209040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,CoS> mtu 1500 index 5
inet 172.16.10.142 netmask ffffff00 broadcast 172.16.10.255
groupname shared0
ether 0:3:ba:e3:42:8d
bge3: flags=269040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE,CoS> mtu 1500 index 6
inet 172.16.10.143 netmask ffffff00 broadcast 172.16.10.255
groupname shared0
ether 0:3:ba:e3:42:8e
The ipmpstat output is different now.
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 0.0.0.0 up shared0 -- -- global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES shared0 shared0 ok 10.00s bge2 bge1 (bge3) global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 no shared0 is----- up ok ok bge2 yes shared0 ------- up ok ok bge1 yes shared0 --mb--- up ok okThe Failure Detection Time is now set. And the probe information option lists an ongoing update of the probe results.
global# ipmpstat -p TIME INTERFACE PROBE NETRTT RTT RTTAVG TARGET 0.14s bge3 426 0.48ms 0.56ms 0.68ms 172.16.10.16 0.24s bge2 426 0.50ms 0.98ms 0.74ms 172.16.10.16 0.26s bge1 424 0.42ms 0.71ms 1.72ms 172.16.10.16 1.38s bge1 425 0.42ms 0.50ms 1.57ms 172.16.10.16 1.79s bge2 427 0.54ms 0.86ms 0.76ms 172.16.10.16 1.93s bge3 427 0.45ms 0.53ms 0.66ms 172.16.10.16 2.79s bge1 426 0.38ms 0.56ms 1.44ms 172.16.10.16 2.85s bge2 428 0.34ms 0.41ms 0.71ms 172.16.10.16 3.15s bge3 428 0.44ms 4.55ms 1.14ms 172.16.10.16 ^CThe target information option shows the current probe targets.
global# ipmpstat -t INTERFACE MODE TESTADDR TARGETS bge3 multicast 172.16.10.143 172.16.10.16 bge2 multicast 172.16.10.142 172.16.10.16 bge1 multicast 172.16.10.141 172.16.10.16Once the zones are up and running and bge1 is down, the status output changes accordingly.
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.143 up shared0 bge2 bge3 bge2 10.1.14.142 up shared0 bge3 bge3 bge2 10.1.14.141 up shared0 bge2 bge3 bge2 0.0.0.0 up shared0 -- -- global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES shared0 shared0 degraded 10.00s bge3 bge2 [bge1] global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 yes shared0 -s----- up ok ok bge2 yes shared0 --mb--- up ok ok bge1 no shared0 ------- down failed failed global# ipmpstat -p TIME INTERFACE PROBE NETRTT RTT RTTAVG TARGET 0.46s bge2 839 0.43ms 0.98ms 1.17ms 172.16.10.16 1.15s bge3 840 0.32ms 0.37ms 0.65ms 172.16.10.16 1.48s bge2 840 0.37ms 0.45ms 1.08ms 172.16.10.16 2.56s bge3 841 0.45ms 0.54ms 0.63ms 172.16.10.16 3.17s bge2 841 0.40ms 0.51ms 1.01ms 172.16.10.16 3.93s bge3 842 0.40ms 0.47ms 0.61ms 172.16.10.16 4.61s bge2 842 0.63ms 0.75ms 0.98ms 172.16.10.16 5.17s bge3 843 0.38ms 0.46ms 0.59ms 172.16.10.16 5.72s bge2 843 0.36ms 0.44ms 0.91ms 172.16.10.16 ^C global# ipmpstat -t INTERFACE MODE TESTADDR TARGETS bge3 multicast 172.16.10.143 172.16.10.16 bge2 multicast 172.16.10.142 172.16.10.16 bge1 multicast 172.16.10.141 172.16.10.16Without showing the details here, the non-global zones continue to function.
Bringing all three interfaces down, things look like this.
Jan 19 13:51:22 global in.mpathd[61]: The link has gone down on bge2 Jan 19 13:51:22 global in.mpathd[61]: IP interface failure detected on bge2 of group shared0 Jan 19 13:52:04 global in.mpathd[61]: The link has gone down on bge3 Jan 19 13:52:04 global in.mpathd[61]: All IP interfaces in group shared0 are now unusable
global# ipmpstat -a ADDRESS STATE GROUP INBOUND OUTBOUND 10.1.14.143 up shared0 -- -- 10.1.14.142 up shared0 -- -- 10.1.14.141 up shared0 -- -- 0.0.0.0 up shared0 -- -- global# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES shared0 shared0 failed 10.00s [bge3 bge2 bge1] global# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE bge3 no shared0 -s----- down failed failed bge2 no shared0 ------- down failed failed bge1 no shared0 ------- down failed failed global# ipmpstat -p ^C global# ipmpstat -t INTERFACE MODE TESTADDR TARGETS bge3 multicast 172.16.10.143 -- bge2 multicast 172.16.10.142 -- bge1 multicast 172.16.10.141 --The whole IPMP group shared0 is down, all appropriate ipmpstat output reflects that, and no probes are listed nor probe RTT time reports are updated.
An additional scenario might be to have two separate paths, and have something other than a link failure force the failover.
Tuesday Jan 13, 2009
Thursday Jan 08, 2009
The feature I have been waiting for the most is the virtual NICs (VNICs). This allows me to create multiple data links using a single physical network interface, such as on my laptop. Each data link can be assigned to a different zone, and with exclusive IP Instance zones, each zone can have separate IP management and characteristics. The most useful one for me is to have one zone working on the native local network, and another zone with IPsec enabled, for a VPN connection.
Previously, I have demonstrated how to do this with two NICs and with one NIC and VNICs. I also have an example of how to achieve this with VNANs.
Now that Crossbow is integrated, things are much simpler!
Some Specifics
First thing I did was create a VNIC. Note that the dladm(1M) commands have changed slightly, both general and for VNICs. To see what physical NICs are available. On my laptop it looks like this. (The option used to be show-dev.)global# dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE ath0 WiFi down 0 unknown ath0 bge0 Ethernet up 1000 full bge0Data links are the entities that can be assigned to a zone, so lets see those.
global# dladm show-link LINK CLASS MTU STATE OVER ath0 phys 1500 down -- bge0 phys 1500 up --Now I create a VNIC.
global# dladm create-vnic -l bge0 vpn0 global# dladm show-link LINK CLASS MTU STATE OVER ath0 phys 1500 down -- bge0 phys 1500 up -- vpn0 vnic 1500 up bge0I used the basic create-vnic format, where I only specified the option over which device to create the VNIC. I let Solaris determine the MAC address, and I did not assign any other properties to the VNIC. The name for a data link must start with characters and end with a number. Thus I chose vpn0 to make it clear to me what I want to use it for. I could have called it vpn123456789, showing that the number part can be quite large.
I now create a zone, and I chose the following configuration.
global# zonecfg -z vpn info
zonename: vpn
zonepath: /zones/vpn
brand: native
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address not specified
physical: vpn0
defrouter not specified
Key items are in bold. The zone is an exclusve IP Instance zone, and I only assigned the vpn0 data link to it. The zone is a sparse zone, and the need to inherit an extra directory for IPsec to work is no longer required (I was curious whether this had been fixed.)After installing (I made a clone of an existing zone) and before booting the zone, I copied into the zone a customized sysidcfg file.
global# cat /zones/vpn/root/etc/sysidcfg
system_locale=C
terminal=xterm
network_interface=PRIMARY {
dhcp
protocol_ipv6=no
}
nfs4_domain=dynamic
security_policy=NONE
name_service=NONE
timezone=US/Eastern
service_profile=limited_net
timeserver=localhost
root_password=YyDStVVvtZX6.
Upon booting, the zone gets an IP address via DHCP. This will be useful for being on a variety of networks. When using wireless, I won't have to change the zone's configuration. I will, however, have to recreate vpn0 on top of ath0.Now I can happily be on a public and the corporate network at the same time. This example has me using the non-global zone to run VPN within. However, depending on my needs at the moment, I could have the global zone be VPNed in, and the non-global zone be on the public network. It is just a matter of where I run the VPN software.
global# ifconfig -a4 lo0: flags=2001000849This demonstrates one of the features of Crossbow. I will now be able to do a lot more with zones, while taking advantage of IP Instances, without needing multiple NICs. This is great for customer demos. I have not covered items such as the virtual switch that is created, or the ability to snoop traffic between zones now, or all the resource monitoring and controls that Crossbow offers. More on that elsewhere and in the future.mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 ath0: flags=201000802 mtu 1500 index 2 inet 0.0.0.0 netmask 0 ether 0:b:6b:80:bc:59 bge0: flags=201004843 mtu 1500 index 3 inet 192.168.15.104 netmask ffffff00 broadcast 192.168.15.255 ether 0:c0:9f:5b:43:33 vpn# ifconfig -a4 lo0: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 vpn0: flags=201004843 mtu 1500 index 2 inet 192.168.15.105 netmask ffffff00 broadcast 192.168.15.255 ether 2:8:20:86:53:e3 ip.tun0: flags=10010008d1 mtu 1366 index 3 inet tunnel src 192.168.15.105 tunnel dst 192.168.101.183 tunnel security settings --> use 'ipsecconf -ln -i ip.tun0' tunnel hop limit 60 inet 192.168.48.27 --> 192.168.76.43 netmask ffffffff
P.S. Crossbow affects and works with a lot of the generic LAN driver (GLD) framework, and delivers a new MAC interface, utilizes improvements in dladm, data link naming (vanity naming from Project Clearview), and lots more, and thus is a lot of code changes. There is a high level of interest in getting the VNIC features into Solaris 10. If you have a strong need for that, please add a Service Record using your support channel to Change Request 6790102.
Wednesday Mar 26, 2008
First, get the latest BFU package from the ON (OS/Net) Consolidation. I typically only use the SUNWonbld tar file for my hardware.
Download the bits you want to install, such as those for Crossbow Beta or Clearview's snoop on loopback
To make life a little simpler, I add the following to root's .profile file.
if [ -d /opt/onbld ] then FASTFS=/opt/onbld/bin/`uname -p`/fastfs ; export FASTFS BFULD=/opt/onbld/bin/`uname -p`/bfuld ; export BFULD GZIPBIN=/usr/bin/gzip ; export GZIPBIN PATH=$PATH:/opt/onbld/bin fi
Now to apply the bits. After unpacking the bits into a temporary location, lets say /tmp/bfu, install the onbld package.
# pkgadd -d onbld all Processing package instanceI re-read my .profile, and verify that the necessary BFU variables are setfrom OS-Net Build Tools(sparc) 11.11,REV=2008.03.18.14.39 Copyright 2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. ... Installation of was successful. #
# . /.profile # echo $FASTFS /opt/onbld/bin/sparc/fastfsNow apply the BFU (this one is for Crossbow beta). You must use the full pathname!
Note: you may want to do this from the console, in case you loose your network connection.
# bfu `pwd`/nightly-nd Copying /opt/onbld/bin/bfu to /tmp/bfu.1000 Executing /tmp/bfu.1000 /tmp/bfu/nightly-nd ... Entering post-bfu protected environment (shell: ksh). Edit configuration files as necessary, then reboot. bfu#Note that you end up in the BFU shell. Now issue an automatic conflict resolution check.
bfu# /opt/onbld/bin/acr Getting ACR information from /tmp/bfu/nightly-nd... ok updating //platform/sun4v/boot_archive Finished. See /tmp/acr.nhaqVi/allresults for complete log. bfu# bfu# exit Exiting post-bfu protected environment. To reenter, type: LD_NOAUXFLTR=1 LD_LIBRARY_PATH=/tmp/bfulib LD_LIBRARY_PATH_64=/tmp/bfulib/64 PATH=/tmp/bfubin /tmp/bfubin/ksh #Its time to reboot and run with the new bits!
Monday Mar 10, 2008
Network Virtualization
Requirement: You need more NICs than are installed or supported on the system. Use zones with exclusive IP Instance, but share a single NIC or small number of NICs.
Feature: Any crossbow supported NIC can now be split up into several VNICs, and those VNICs can be assigned to different zones. Optionally, resource management can be applied to any or all VNICs.
Benefit: Zones that need network administrative isolation can share a single NIC. Traffic between zones with exclusive IP Instances can be contained within the system if the zones use VNICs on the same NIC. Resource management can be used to limit CPU or network bandwidth associated with a zone by applying controls on a VNIC.
How to Demonstrate:
- create zones if they don't exist
- configure zones as ip-type=exclusive
- create VNICs
- assign VNICs to zones
- boot zones
- observe distributed traffic
- optionally apply resource controls and observe
- create VNICs
- assign IP addresses to VNICs
- run services bound to separate IP addresses
- observe distributed traffic
- optionally apply resource controls and observe
Network Traffic Observability
Requirement: Need to measure and monitor network traffic for different services on the system.
Feature: Bytes and packets received and transmitted can be counted and monitored.
Benefit: Better understanding of network traffic patterns, and potential data points to make future resource control decisions. Opportunity to do chargeback based on network usage.
How to Demonstrate:
- create one or more VNICs using dladm
- create one or more flows using flowadm
- show data in real-time using dladm or flowadm
- show historical data
- show for data link/NIC, VNIC, and flow
Network Resource Management
Requirement: Limit the amount of network bandwidth used by a service. Control which CPU(s) are used to process network traffic for a service.Feature: Limits on the maximum network traffic in bits/second can be set. Network traffic processing can be directed to one or more CPUs, providing for better response time for the network stack, or insuring that network stack processing will not interfere with other resource consumers on the system.
Benefit: Finer control of resource utilization. Ability to set quality of service. Prevention of resource starvation by competing consumers. Denial of Service attack defense.
How to Demonstrate:
- create one or more VNICs using dladm
- create one or more flows using flowadm
- set bandwidth caps on VNICs or flows
- set CPU binding on VNICs or flows
- see limits enforced under heavy network load by observing the application(s)' data throughput, for example, metrics from
- wget
- ftp
- dladm
- flowadm statistics
- your own application metric(s)
- show different CPU utilization or distribution using mpstat
Note: bandwidth guarantees are not available at this time.
Network Performance Improvements
Requirement: Faster network processing. More efficient network processing.Features: Improved datagram processing within the IP stack. Automatic switching between interrupt and polling to speed packet processing and remove interrupt overhead.
Benefit: Existing network applications will run faster, with lower latency, higher throughput, and more CPU available to other services. Not application changes are required.
How to Demonstrate:
Compare your application's performance differences
- using Solaris Nevada build 81 vs. Crossbow beta
- using Solaris 10 vs. Crossbow beta
Improved IP Forwarding
Requirement: Faster forwarding of IP datagrams.Feature: Faster forwarding of IP datagrams, especially as routing/forwarding tables get large.
Benefit: Solaris is a better platform for routers and firewalls.
How to Demonstrate:
Compare your router's performance differences
- using Solaris Nevada build 81 vs. Crossbow beta
- using Solaris 10 vs. Crossbow beta
Additional Info
Nicolas' Private Virtual Network
Sunay's blog on network in a box
Thursday Feb 14, 2008
The code is available as a customized Nevada build 81 image, or you can install the BFU bits on top of an existing build 81 install. It may work with a slightly older or newer build (I did some testing with build 82), but that has not been fully tested.
Plans are to put the features into Nevada after the beta period and your feedback.
Thanks to the engineering for all the effort in getting this out! Many of my customers have been waiting for this to become available.
The SPARC patches are:
- 137042-01 SunOS 5.10: zoneadmd patch
- 118777-12 SunOS 5.10: Sun GigaSwift Ethernet 1.0 driver patch
The x86 patches are:
- 137043-01 SunOS 5.10_x86: zoneadmd patch
- 118778-11 SunOS 5.10_x86: Sun GigaSwift Ethernet 1.0 driver patch
I have not been able to try out the released patches myself, yet.
Steffen
Friday Feb 08, 2008
The new or enhanced features in S10 Update 5 Beta Release to be tested include but are not limited to:
- Infiniband flash update tool
- Sockets Direct Protocol (SDP)
- Persistent Group Reservation for iSCSI target
- iSNS Client for iSCSI target
- IP addressing ability for IBTF interfaces
- Graphical User Interface for PostgreSQL (pgAdmin 3)
- Support for download of new firmware into SATA drives
- Support for Enhanced Intel SpeedStep power management technology
- SunVTS 7.0
- SAS multipathing support
- Flash 9
- New Instant Messaging Client (pidgin 2.0)
- Virtual Network Computing (VNC)
- Capping CPU resource usage
The beta will include VNICs, flows, improved IP forwarding, hardware classification. More details to come.
Wednesday Jan 30, 2008
137042-01 (SPARC) 137043-01 (i386, x86, x64)
The patches should be available in about two weeks, after final internal and customre testing. If you have a service contract, you can get a temporary T-patch as interim relief, with all the caveats of a T-patch. Folks with an escalation should already have been notified. The fix will also be delivered in the next update of Solaris 10. It did not make the Beta of that update (Update 5), however. Don't forget, you also need the ce patch:
118777-12 (SPARC) 118778-11 (i386, x86, x64)
Happy IP-Instancing with ce!!
Monday Jan 28, 2008
Solaris has historically allowed only 256 stdio streams to be open, where the file descriptors are below 256. So applications can quickly run out of file descriptors when doing lots of fopen() calls. For 32-bit applications, it has not been possible to increase this limit, as it could cause binary compatibility issues for older applications (compatibility going back as far back as those compiled on SunOS 4.x). The dup(2) system call has been used to move other file descriptors above 256 to free up slots for fopen. But the application is still limited to a maximum of 256 stdio streams!
With the release of Solaris 10 8/07 (often referred to as update 4), there is a new interface to extend the FILE facility. Programming details are in the man page enable_extended_FILE_stdio(3C). And if you don't want to make any code changes, extendedFILE(5) describes how to do this for existing applications and binaries.
I am working with a customer who needs to host over 1,400 web sites. We are using portions of the coolstack, as well as customized versions of Apache and PHP. With virtual hosting, the setup quickly hit the 256 stdio file limit!
With a small change to apachectl, it is now possible to host all 1,400+ web sites within a single instance of Apache. I added the following to the configuration section of apachectl:
ulimit -n 3000 LD_PRELOAD_32=/usr/lib/extendedFILE.so.1 ; export LD_PRELOAD_32
The ulimit -n 3000 increases the number of file descriptors a process can have open to 3000, up from the default of 256. Since apachectl is run as root, or with sufficient privileges using Role Bases Access Control, this is permitted.
The LD_PRELOAD_32 setting allows me to have the library provide special versions of library functions or system calls. In this case, it does special things when fopen is called, and automatically uses dup(2) to free up the lower 256 file descriptors.
The enable_extended_FILE_stdio(3C) man pages lists some of the requirements for an application to work well with this interposition library, such as not doing direct access into the fields of the FILE structure. Since Apache is using stdio for log files, it is unlikely that Apache is accessing the structures directly.
Testing with the customer's configuration has Apache serving up all 1,400 web sites using a single instance of the httpd server! Cool, success at last!
Thursday Dec 20, 2007
Zone configuration information:
global# zonecfg -z ce1 info net
net:
address not specified
physical: ce1
global#
And the view from the non-global zone:
ce1# zonename
ce1
ce1# cat /etc/release
Solaris Express Community Edition snv_80 SPARC
Copyright 2008 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 17 December 2007
ce1# ifconfig -a
lo0: flags=2001000849 mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
ce1: flags=1000843 mtu 1500 index 2
inet 192.168.200.153 netmask ffffff00 broadcast 192.168.200.255
ether 0:3:ba:68:1d:5f
lo0: flags=2002000849 mtu 8252 index 1
inet6 ::1/128
ce1#
More when the soak time in Nevada is complete and the backport to Solaris 10 is available.
Thanks to the engineers who put energy into these fixes!
Happy Holidays!
Steffen
[1] As of 20 December 2007, build 80 is available within Sun only. Availability on opensolaris.org will be announced on opensolaris-announce@opensolaris.org.
Wednesday Dec 05, 2007
- x7285a - Sun PCI-X Dual GigE UTP Low Profile. RoHS-6 compliant
- x7286a - Sun PCI-X GigE MMF Low Profile, RoHS-6 compliant
Thursday Nov 15, 2007
Many of the users running systems with GigaSwift NICs are also interested in running zones with exclusive IP Instances.
However, the ce drivers is a DLPI style-1 driver, not the GLDv3 driver required by IP Instances. Because of the large install base of the GigaSwift NICs, one consideration has been to convert the ce driver to GLDv3. The challenge is: since a lot of users of this NIC also tune its characteristics with ndd(1M), converting ce to GLDv3 would essentially eliminate those tunables. There is work in progress to provide a shim for non-GLDv3 drivers to make the work within the GLDv3 framework. This won't be delivered into Solaris Nevada or OpenSolaris until early next year, and then will need to be backported to Solaris 10.
What do we do for all those users who are currently using ce in the meantime?
Change Requests 6606507 and 6616075 are being worked on to support the ce driver in zones with exclusive IP Instance. CR 6616075 is for zoneadmd(1M) changes to issue an ioctl when an interface (the "physical" part of the net directive in the zone configuration) is not GLDv3. These are separate CRs because zoneadmd is in Solaris ON (the OS and Networking consolidation) while ce is outside of ON.
The changes are ready to be put in Nevada and OpenSolaris, where the code will undergo a mandatory soak test period of four to six week. Once everything passes Nevada testing, and the changes are integrated into Solaris 10, the patch is created, tested, and issued.
NOTE: Updated 10 December 2007 to correct the bug ID for the zoneadmd part.
Updated 12 March 2008: The patches are now available. See the entry dated Wednesday Jan 30, 2008.
Monday Nov 05, 2007
Solaris 10 8/07 includes a new feature for zone networking. IP Instances is the facility to give a non-global zone its own complete control over the IP stack, which previously was shared with and controlled by the global zone.
A zone that has an exclusive IP Instance can set interface parameters using ifconfig(1M), put an interface into promiscuous mode to run snoop(1M), be a DHCP client or server, set ndd(1M) variables, have its own IPsec policies, etc.
One requirement for an exclusive IP Instance is that it must have exclusive access to a link name. This is any NIC, VLAN-tagged NIC component, or aggregation at this time. When they become available, virtual NICs will make this much simpler, as a single NIC can be presented to the zones using a number of VNICs, effectively multiplexing access to that NIC. A link name is an entry that can be found in /dev, such as /dev/bge0, /dev/bge321001 (VLAN tag 321 on bge1), aggr2, and so on.
To see what link names are available on a system, use dladm(1M) with the show-link option. For example:
global# dladm show-link bge0 type: non-vlan mtu: 1500 device: bge0 bge1 type: non-vlan mtu: 1500 device: bge1 bge2 type: non-vlan mtu: 1500 device: bge2 bge3 type: non-vlan mtu: 1500 device: bge3
As folks have started to use IP Instances to isolate their zones, they have noticed that they don't have sufficient link names (I'll use just link in the rest of this) to assigned to the zones that have or wish to configure as exclusive. So, how does a global zone administrator configure a large number of zones as exclusive?
Let's consider the following situation, where there are three tiers of a web service, where each tier is on a different network.
If each server has only one NIC, the total number of switch ports required is at least eight (8). If each server has a management port, that is another eight ports, even if they are on a different, management network. Add to that at least three three switch ports going to the router.
Consolidating the servers onto a single Solaris 10 instance using exclusive IP Instances requires at least eight NICs for the services (one per service), and at least one for the global zone and management. (We'll ignore a service process requirements, since they are separate anyway, and access could be either via a serial interface or a network.)
One option to consider is using VLANs and VLAN tagging. When using VLAN tagging, additional information is put onto the ethernet frame by the sender which allows the receiver to associated that frame to a specific VLAN. The specification allow up to 4094 VLAN tags, from 1 to 4094. For more information on administering VLANs in Solaris 10, see Administering Virtual Local Area Networks in the Solaris 10 System Administrator Collection.
VLANs is a method to collapse multiple ethernet broadcast domains (whether hubs or switches) into a single network unit (usually a switch). [Typically, a single IP subnet, such as 192.168.54.0/24, is on a broadcast domain. Within such a switch frame, you can have a large number of virtual switches, consolidating network infrastructure and still isolating broadcast domains. Often, the use of VLANs is completely hidden from the systems tied to the switch, as a port on the switch is configured for only one VLAN. With VLAN tagging, a single port can allow a system to connect to multiple VLAns, and therefore multiple networks. Both the switch and the system must be configured for VLAN tagging for this to work properly. VLAN tagging has been used for years, and is robust and reliable.
Any one network interface can have multiple VLANs configured for it, but a single VLAN ID can only exist once on each interface. Thus it is possible to put multiple networks or broadcast domains on a single interface. It is not possible to put more than one VLAN of any broadcast domain on a single interface. For example, you can put VLANs 111, 112, and 113 on interface bge1, but you can not put VLAN 111 on bge1 more than once. You can, however, put VLAN 111 on interfaces bge1 and bge2.
Using the case shown above, if the three web servers are on the same network, say 10.1.111.0/24, you would want to have three interfaces that are all connected to a VLAN capable switch, and configure each interface with a VLAN tag that is the same as the VLAN ID on the switch.
For example, if the VLAN tag is 111 and the interfaces are bge1 through bge3, the link names you would assign to the three web servers would be bge111001, bge111002, and bge111003.
Introducing zones into the setup, the web servers can be run in three separate zones, and with exclusive IP Instances, they can be totally separate and each assigned a VLAN-tagged interface. Web Server 1 could have bge111001, Web Server 2 could have bge111002, and Web Server 3 could have bge111003.
global# zonecfg -z web1 info net
net:
address not specified
physical: bge111001
global# zonecfg -z web2 info net
net:
address not specified
physical: bge111002
global# zonecfg -z web3 info net
net:
address not specified
physical: bge111003
Within the zones, you could configure IP addresses 10.1.111.1/24 through 10.1.111.3/24.
Similarly, for the authentication tier, using VLAN ID 112, you could assign the zones auth1 through auth3 to bge112001, bge112002, and bge112003,respectively. And for application servers app1 and app2 on VLAN ID 113, bge113001 and bge113002. This can be repeated until some limit is reached, whether it is network bandwidth, system resource limits, or the maximum number of concurrent VLANs on either the switch or Solaris.
This configuration could look like the following diagram.
Web Server 1, Auth Server 1, and Application Server 1 share the use of NIC1, yet are all on different VLANs (111, 112, and 113, respectively). The same for instances 2 and 3, except that there is no third application server. All traffic between the three web servers will stay within the switch, as will traffic between the authentication servers. Traffic between the tiers is passed between the IP networks by the router. NICg is showing that the global zone also has a network interface.
Using this technique, the maximum number of zones with exclusive IP Instances you could deploy on a single system that are on the same subnet is limited to the number of interfaces that are capable of doing VLAN tagging. In the above example, with three bge interfaces on the system, the maximum number of exclusive zones on a single subnet would be three. (I have intentionally reserved bge0 for the global zone, but it would be possible to use it as well, making sure the global zone uses a different VLAN ID altogether, such as 1 or 2.)
Wednesday May 30, 2007
Security policy states that if a system is 'punched in', it must not be on the public network at the same it. In other words, while the VPN tunnel is up, access to the Internet directly is restricted, especially access from the Internet to the system. While a system is on the VPN, it can not also be your Internet facing personal web server, for example.
Bringing up the VPN is an interactive process, requiring a challenge/response sequence. If you are like me, you may have a system at home and while at work need to access from that system some data on the corporate network. This is a catch-22, since the connection you use remotely to activate the VPN breaks as you start the VPN establishment process (enforcing the policy of being on only one network at a time).
Introduce Solaris Containers, or zones. Each zone looks like its own system. However, they share a single kernel and single IP. But wait, there is this new thing called IP Instance that allows zones configured as having an exclusive IP Instance to have their own IP (they already have their own TCP and UDP for all practical purposes). And wouldn't it be great if I could do this with just one NIC? Hey, Project Crossbow has IP Instances and VNICs. Great!
Now for the reality check. As I was told not so long ago, Rome was not built in one day. IP Instances are in Solaris Nevada and targeted for Solaris 10 7/07. VNICs are only available in a snapshot applied via BFU to Nevada build 61. [See also Note 1 below.]
So, lets see how to do this with just IP Instances.
First, since each instance, which are at least the global zone and one non-global need their own NIC, I need at least two NICs. Not all NICs support IP Instances, so the one(s) for the non-global zone(s) need to support IP Instances, and thus must be using GLDv3 drivers.
In my case, I am using a Sun Blade 100 with an on-board eri 100Mbps Ethernet interface. I purchased an Intel 1000/Pro MT Server NIC, which requires an e1000g driver. Here is a list of NICs that are known to work with IP Instances and VNICs.
After installing Solaris Nevada, I created my non-global zone with the following configuration:
global# zonecfg -z vpnzone info
zonename: vpnzone
zonepath: /zones/vpnzone
brand: native
autoboot: true
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
inherit-pkg-dir:
dir: /etc/crypto/certs
fs:
dir: /usr/local
special: /zones/vpnzone/usr-local
raw not specified
type: lofs
options: []
net:
address not specified
physical: e1000g0
global#
I had to include an additional inherit directive for this sparse, because currently some of the crypto stuff is not duplicated into a non-global zone. Without this, even the digest command would fail, for example. I needed to provide a private directory for /usr/local since that is where the Punchin packages get installed by default. Once I installed and configured vpnzone, I was able to install and configure the Punchin client.
However, this required two NICs. So to use just one, I created a VNIC for my VPN zone.
global# dladm show-dev
eri0 link: unknown speed: 0Mb duplex: unknown
e1000g0 link: up speed: 100Mb duplex: full
global# dladm show-link
eri0 type: legacy mtu: 1500 device: eri0
e1000g0 type: non-vlan mtu: 1500 device: e1000g0
global# dladm create-vnic -d e1000g0 -m 0:4:23:e0:5f:1 1
global# dladm show-link
eri0 type: legacy mtu: 1500 device: eri0
e1000g0 type: non-vlan mtu: 1500 device: e1000g0
vnic1 type: non-vlan mtu: 1500 device: vnic1
global# ifconfig -a
lo0: flags=2001000849"<"UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL">" mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=201000843"<"UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS">" mtu 1500 index 2
inet 192.168.1.58 netmask ffffff00 broadcast 192.168.1.255
ether 0:4:23:e0:5f:6b
global#
I chose to provide my on MAC address, based on the address of the base NIC. I modified the non-global zone configuration:
global# zonecfg -z vpnzone info
zonename: vpnzone
zonepath: /zones/vpnzone
brand: native
autoboot: true
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
inherit-pkg-dir:
dir: /etc/crypto/certs
fs:
dir: /usr/local
special: /zones/vpnzone/usr-local
raw not specified
type: lofs
options: []
net:
address not specified
physical: vnic1
global#
Now I can access the system at home while I am not there, zlogin into vpnzone, punchin, and be connected to our internal network. This is really significant for me, since at home I have 6Mbps download compared to only 600Kbps in the office. So downloading the DVD ISO that I used to create this setup took 1/10th the time at home than at work. [1] I also used the SUNWonbld package. This package is specific to build 61!
Because I install BFUs a lot, I have added the following to my .profile
if [ -d /opt/onbld ] then FASTFS=/opt/onbld/bin/`uname -p`/fastfs ; export FASTFS BFULD=/opt/onbld/bin/`uname -p`/bfuld ; export BFULD GZIPBIN=/usr/bin/gzip ; export GZIPBIN PATH=$PATH:/opt/onbld/bin fi
Saturday May 12, 2007
With IP Instances in Solaris Nevada build 57 and targeted for Solaris 10 7/07, there is the ability to configures zones with exclusive IP Instances, thus forcing all traffic leaving a zone out onto the network. This introduces additional network stack processing both on the transmit and the receive. Prompted by some customer questions regarding this, I performed a simple test to measure the difference.
On two systems, a V210 with two 1.336GHz CPUs and 8GB memory, and an x4200 with two dual-core Opteron XXXX and 8GB memory, I ran FTP transfers between zones. My switch is a Netgear GS716T Smart Switch with 1Gbps ports. The V210 has four bge interfaces and the x4200 has four e1000g interfaces.
I created four zones. Zones x1 and x2 have eXclusive IP Instances, while zones s1 and s2 have Shared IP Instances (IP is shared with the global zone). Both systems are running Solaris 10 7/07 build 06.
Relevant zonecfg info is a follows (all zones are sparse):
v210# zonecfg -z x1 info
zonename: x1
zonepath: /localzones/x1
...
ip-type: exclusive
net:
address not specified
physical: bge1
v210# zonecfg -z s1 info
zonename: s1
zonepath: /localzones/s1
...
ip-type: shared
net:
address: 10.10.10.11/24
physical: bge3
As a test user in each zone, I created a file using 'mkfile 1000m /tmp/file1000m'. Then I used ftp to transfer it between zones. No tuning was done whatsoever. The results are as follows.
V210: (bge) Exclusive to Exclusive x1# /usr/bin/time ftp x2 << EOF^Jcd /tmp^Jbin^Jput file1000m^JEOF real 17.0 user 0.2 sys 11.2 Exclusive to Shared x1# /usr/bin/time ftp s2 << EOF^Jcd /tmp^Jbin^Jput file1000m^JEOF real 17.3 user 0.2 sys 11.6 Shared to Shared s2# /usr/bin/time ftp s1 << EOF^Jcd /tmp^Jbin^Jput file1000m^JEOF real 6.6 user 0.1 sys 5.3 X4200: (e1000g) Exclusive to Exclusive x1# /usr/bin/time ftp x2 << EOF^Jcd /tmp^Jbin^Jput file1000m^JEOF real 9.1 user 0.0 sys 4.0 Exclusive to Shared x1# /usr/bin/time ftp s2 << EOF^Jcd /tmp^Jbin^Jput file1000m^JEOF real 9.1 user 0.0 sys 4.1 Shared to Shared s2# /usr/bin/time ftp s1 << EOF^Jcd /tmp^Jbin^Jput file1000m^JEOF real 4.0 user 0.0 sys 3.5I ran each test several times and picked a result that seemed average across the runs. Not very scientific, and a table might be nicer.
Something I noticed that surprised me was that time spent in IP and the driver is measurable on the V210 with bge, and much less so on the x4200 with e1000g.
This blog copyright 2009 by stw