Thursday February 14, 2008 | Nicolas Droux' Blog |
|
All
|
General
Private virtual networks for Solaris xVM and Zones with Crossbow Virtualization is great: save money, save lab space, and save the planet. So far so good! But how do you connect these virtual machines, allocate them their share of the bandwidth, and how do they talk to the rest of the physical world? This is where the OpenSolaris Project Crossbow comes in. Today we are releasing a new pre-release snapshot of Crossbow, an exciting OpenSolaris project which enables network virtualization in Solaris, network bandwidth partitioning, and improved scalability of network traffic processing. This new release of the project includes a new features which allows you to build complete virtual networks that are isolated from the physical network. Virtual machines and Zones can be connected to these virtual networks, and isolated from the rest of the physical network through firewall/NAT, etc. This is useful when you want to prototype a distributed application before deploying it on a physical network, or if you want to isolate and hide your virtual network. This article shows how Crossbow can be used together with NAT to build a complete virtual network connecting multiple Zones within a Solaris host. The same technique applies to xVM Server x64 as well, since xVM uses Crossbow for its network virtualization needs. A detailed description of the Crossbow virtualization architecture can be found in my document here. In this example, we will build the following network:
First we need to build our virtual network, this can be done very simply using Crossbow using etherstubs. An etherstub is a pseudo ethernet NIC which can be created with dladm(1M). VNICs can then be created on top of that etherstub. The Crossbow MAC layer of the stack will implicitly create a virtual switch between all the VNICs sharing the same etherstub. In the following example we create an etherstub and three VNICs for our virtual network.
By default Crossbow will assign a random MAC address to the VNICs, as we can see from the following command:
You could also assign a bandwidth limit to each VNIC by setting the maxbw property during VNIC creation. At this point we are done creating our virtual network. In the case of xVM, you would specify "etherstub0" instead of a physical NIC to connect the xVM domain to the virtual network. This would cause xVM to automatically create a VNIC on top of etherstub0 when booting the virtual machine. xVM configuration is described in the xVM configuration guide. Now that we have our VNICs we can create our Zones. Zone test1 can be created as follows:
Note that in this case the zone is assigned its own IP instance ("set ip-type=exclusive"). This allows the zone to configure its own VNIC which is connected to our virtual network. Now it's time to setup NAT between our external network and our internal virtual network. We'll be setting up NAT with IP Filter, which is part of OpenSolaris, based on the excellent NAT write up by Rich Teer. In our example the global zone will be used to interface our private virtual network with the physical network. The global zone connects to the physical network via eri0, and to the virtual private network via vnic0, as shown by the figure above. The eri0 interface eri0 is configured the usual way, and in our case its address is assigned using DHCP:
We will assign a static IP address to vnic0 in the global zone:
Note that the usual configuration variables (e.g. /etc/hostname.
Then we can enable NAT on the eri0 interface. We're using a simple NAT configuration in /etc/ipf/ipnat.conf:
We also need to enable IP filtering on our physical network-facing NIC eri0. We run "ipnat -l" to verify that our NAT rules have been enabled.
Now we can boot our zones:
Here I assigned the address 192.168.0.100 to the vnic1 assigned to zone test1:
Routing Table: IPv4 Routing Table: IPv6 Note that the zone appears to be on a network and has what looks like a regular NIC with a regular MAC address. In reality, this zone is connected to a virtual network isolated from the physical network. From that non-global zone, we can now reach out to the physical network via NAT running in the global zone:
From the global zone, we can query NAT to see the translations taking place:
List of active sessions: Of course this is only the tip of the iceberg. You could deploy NAT from a non-global zone itself, or deploy a virtual router on your virtual network, you could enable additional filtering rules, etc, etc. Of course you are not limited to only one virtual network. You can create multiple virtual networks within a host, route between these networks, etc. We are exploring some of the possibilities as part of the Crossbow and Virtual Network Machines projects. Posted by droux ( Feb 14 2008, 06:27:29 PM PST ) Permalink Comments [1]Virtual Switching in Solaris with Crossbow VNICs Virtual NICs, also known as VNICs, are a core components of project Crossbow. They allow physical NICs to be shared by multiple Zones or virtual machines such as Xen domains. VNICs appear to the rest of the system as regular NICs. VNICs can be assigned a subset of the hardware resources (interrupts, rings, etc) made available by the underlying hardware. In order to provide connectivity between the multiple Zones or virtual machines sharing a single physical NIC, the VNIC layer also provides a data-path between the VNICs defined on top of the same underlying NIC. The VNICs sharing the same underlying NIC appear to be part of the same segment, i.e. connected to a same virtual switch. The virtual switch concept also allow fully virtual networks to be be built within a machine. A couple of days ago I posted a first draft design document describing the concept of virtual switches, how they are implemented by VNICs in Solaris, and how they can be used in practice. Posted by droux ( Apr 02 2007, 10:13:11 PM PDT ) Permalink Comments [1] OpenSolaris virtualization technologies It was great to be in Seattle last week to kick off a new year of Tech Days. The monthly events this year include a day dedicated to OpenSolaris, which is a great opportunity to catch up with OpenSolaris projects, hear from OpenSolaris engineers, and connect with the local community. Watch for our tour coming to a city near you. In Seattle, I had the pleasure to give a presentation on OpenSolaris Virtualization Technologies covering a wide range of topics such as Zones, BrandZ, Xen, and CrossBow. My presentation is now available online (I'd like to thank Tim, Todd, Joost, David, and Nils for providing some material for this presentation.) Other excellent talks at the OpenSolaris day included a presentation by Stephen on building and deploying OpenSolaris, a great presentation from Glenn on OpenSolaris security features, as well as energized talk by Jim on OpenSolaris POD (Performance Observability and Debugging.) Teresa concluded this busy day by leading a BOF which lead to the creation of a new OpenSolaris user group in Seattle
Technorati Tag: Solaris Posted by droux ( Sep 12 2006, 09:47:53 AM PDT ) Permalink OpenSolaris Day at Tech Days in Seattle, September 6 The first stop of the OpenSolaris world tour as part of the Tech Days will be in Seattle on September 6th. Attendence to the OpenSolaris day is free, but you need to register very soon, space is limited. I've been working on a presentation on the OpenSolaris Virtualization Technologies that I'll will be presenting in Seattle. See the agenda for the OpenSolaris day in particular or the whole event for more info. Technorati Tag: OpenSolaris Posted by droux ( Aug 30 2006, 05:18:25 PM PDT ) PermalinkCrossBow early access bits now available on OpenSolaris.org I just spent most of last week driving the first release of CrossBow on OpenSolaris. The first part of my week was dedicated to some intense coding to putback about 15 fixes and features to the project gate on time for the release, and was followed by a marathon of documentation and process to get the bits available to the rest of the world [1]. CrossBow redefines network virtualization and resource control. Check out our project if you haven't already, try the goods, and join us at crossbow-discuss@opensolaris.org [1] Thanks to Carol Gayo for setting up the initial download page, Michael Lim and Gopi Kunhappan for sanity checking the binaries before the release, as well as Dan Groves and Stephen Lau for their help building and posting the images. Technorati Tag: Solaris Posted by droux ( Aug 27 2006, 09:21:21 PM PDT ) Permalink Project Nemo now live on OpenSolaris Project Nemo is now an official OpenSolaris project. Nemo, a.k.a. GLDv3, is a high-performance device driver framework which provides 802.3ad Link Aggregation and VLAN support for off-the-shelf device drivers. The following drivers are currently based on Nemo (a.k.a. GLDv3) framework: bge, e1000g, xge, nge, rge, ixgb. Project Nemo was also a recipient of Sun's 2006 Chairman Award for Innovation. Nemo also kicks-a** in the performance area with features such as direct function calls and packet chaining between IP and device driver, IP controlling the NIC and dynamically blanking interrupts, and support for stateless hardware offloading. The project page will point you to the Nemo source code in Solaris, design documents, list of active and future projects, etc.
Technorati Tag: Solaris Posted by droux ( Jun 08 2006, 02:51:40 PM PDT ) Permalink Know your network data-links - dladm(1M) can help Data-links in Solaris correspond to NICs, aggregations, or VLANs. The dladm(1M) command line tool provides an easy way to list them all, including some of their properties. dladm(1M) was introduced as part of Project Nemo in Solaris Nevada, Solaris 10 Update 1, and OpenSolaris. There are two dladm subcommands that I want to cover here: the first is show-link, which shows the non-VLAN and VLAN data links, their MTU, and underlying device or aggregation. The second subcommand is show-dev, which shows the network devices present on the machine and their hardware state, such as link state, link speed, and duplex information. The following example shows dladm in action on jurassic, a large server that hosts the home directories and email for the members of the OS development organization at Sun. Jurassic allows us to "eat our own dog food" by always running the latest Solaris development build in a production environment.
From the example above, you can see that jurassic has a bunch of data-link corresponding to devices of various kinds, some of which are combined to form two aggregations, on top of which nine VLANs are defined. To further improve high-availability, these VLANs are grouped to form nine IPMP groups spanning two separate physical switches. IPMP groups are not yet listed by dladm show-link but will be after the IPMP reachitecture completes. show-dev is a related dladm(1M) subcommand which lists only physical NICs along with their physical link state. The following shows dladm show-dev in action on jurassic.
So there you have it, dladm away! Also check out the dladm(1M) man page for a full description of these subcommands and more.
Technorati Tag: Solaris Posted by droux ( Jun 07 2006, 06:26:45 PM PDT ) Permalink Link Aggregation vs IP Multipathing We introduced Link Aggregation capabilities (based on IEEE 802.3ad) in Solaris as part of the Nemo project (a.k.a GLDv3). I described the Solaris Link Aggregation architecture in a previous blog entry, and it is also documented on docs.sun.com and by the dladm(1M) man page. Link aggregations provide high availability and higher throughput by aggregating multiple interfaces at the MAC layer. IP Multipathing (IPMP) provides features such as higher availability at the IP layer. It is described on docs.sun.com. Both IPMP and Link Aggregation are based on the grouping of network interfaces, and some of their features overlap, such as higher availability. These technologies are however implemented at different layers of the stack, and have different strengths and weaknesses. The list below is my attempt to compare and contrast Link Aggregation and IPMP. I should disclaim that I was responsible for designing and implementing Link Aggregation in Solaris, but I made every effort to keep the list below balanced and neutral :-)
It's also worth pointing out that IPMP can be deployed on top of aggregation to maximize performance and availability. Of course these two technologies are still being actively developed, and some of the shortcomings listed above of either technologies will be addressed with time. IPMP for instance is currently undergoing a rearchitecture which is described in this document on OpenSolaris.org. Several improvements of Nemo are also in progress, such as making Link Aggregations available to any device on the system, as described by the Nemo Unification design document. Thanks to Peter Memishian for checking my facts on IPMP and for his contributions.
Technorati Tag: Solaris Posted by droux ( May 03 2006, 12:52:16 PM PDT ) Permalink Comments [3] We recently opened Project Crossbow OpenSolaris.org. Crossbow enables network virtualization and resource control on Solaris. Virtual NICs (Virtual Network Interface Cards, or VNICs) are major components of the Crossbow architecture. Since I'm responsible for this part of the project, I wanted to give you a brief introduction to VNICs, and how they are used. Virtualization in general is a very attractive proposition, and widely used today to consolidate hosts and services. Solaris Zones, which has been available since Solaris 10, is one method by which a Solaris instance can be partitioned into multiple runtimes sharing the same Solaris kernel. Xen is another virtualization project which allows multiple virtual machines, consisting of their own (possibly different) kernel, to run on the same hardware host. VNICs allow carving up physical NICs, or aggregation of NICs, to form virtual NICs. These NICs behave just like any other network card for the rest of the system. They have MAC addresses, can be plumbed and configured from ifconfig, etc. VNICs can be assigned to zones or virtual machines (for example Xen domains) running on the machine. One of the benefits of Crossbow is that VNICs can be assigned their own bandwidth limits or guarantees. These limits effectively allow assigning a part of the underlying NIC bandwidth to zones or virtual machines. The enforcement of that limit is done by the squeues which are assigned to the VNIC. When the physical NIC provides hardware classification capabilities and multiple receive rings, these receive rings are assigned to VNICs directly, and the classifier is programmed to allow traffic received for a given VNIC to land directly on the hardware rings assigned to the VNIC. This allows VNIC to be implemented without performance penalty. When the underlying physical NIC doesn't provide these hardware capabilities, the MAC layer on top of the NIC driver does the software classification to the VNICs through software rings. The following figure shows two VNICs defined on top of the same physical NIC, and assigned to two separate zones.
The figure above also shows another option of VNICs, which consists of assigning multiple hardware rings to a single VNIC. Some of these rings can then be assigned to separate services or protocols, and be given different bandwidth or priority requirements. In the example above, zone1 assigned its own ring to https traffic, for which it can assign a higher bandwidth. As you can see, VNIC is a powerful construct and a pillar of Crossbow. If you are interested by project Crossbow, please read more about it on our OpenSolaris project page. Our discussion forum awaits your comments or questions.
Technorati Tag: Solaris Posted by droux ( Apr 13 2006, 11:12:42 PM PDT ) Permalink Project Nemo design document now available By popular demand the Project Nemo design document is now available on OpenSolaris.org. Project Nemo aims at improving the performance, and accelerate the development and adoption of high-performance network drivers in Solaris. Project Nemo was integrated in Solaris Nevada build 12 and Solaris 10 Update 1, and is still being improved to support additional advanced networking hardware features, broaden support for different device types and legacy drivers, etc. We're also hoping to make Project Nemo an official OpenSolaris project, watch this space for further news on that subject.
Technorati Tag: Solaris Posted by droux ( Mar 14 2006, 01:16:18 AM PST ) Permalink Solaris Link Aggregations (2): Configuration In my previous entry, I described the architecture of the Solaris Link Aggregations. Today, we'll take a quick look at how easily this feature can be used to create aggregations of NICs with higher bandwidth and availability. Currently, only devices that plug into the GLDv3 (a.k.a. Nemo) framework can be aggregated. Out of the box, this currently includes bge (1 Gb/s Broadcom based), e1000g (1 Gb/s Intel based), and xge (10 Gb/s Neterion based). More drivers are being ported from DLPI or GLDv2 to GLDv3, and the Nemo Unification project currently underway and led by Cathy is going to provide a shim layer that will allow all DLPI-based drivers to plug into the GLDv3 framework. Suppose your machine has four Gigabit Ethernet NICs, bge0-3, that you want to aggregate (our newest servers such as the Niagara-based Sun Fire T2000 and T1000, as well as our AMD Opteron-based Sun Fire X4100 and X4200 servers already come with four on-board gigabit ethernet ports, and it's also possible to add single, dual, or even quad gigabit-ethernet adapters to a system.) To aggregate these network interfaces, you simply run the following command: # dladm create-aggr -d bge0 -d bge1 -d bge2 -d bge3 1 That's it! You now have a 4 Gb/s pipe to your machine (yes, it loves to scale, I'll show you in a future article). The previous command caused a new device "aggr1" to be created, which you can plumb and configure with ifconfig(1M) like any other device, for example: # ifconfig aggr1 plumb # ifconfig aggr1 inet 192.168.1.1 up
All the aggregation configuration information is persistant across
reboot automatically, so you don't have to edit any other file than
the usual /etc/hostname.
The full set of options of the create-aggr subcommand are described
in details in the dladm(1M) man page. Some of these options allow
enabling LACP, changing the traffic distribution policy,
setting an explicit MAC address (by default, the
aggregation driver uses the address of one of the constituent ports), etc.
Note that the last argument of the create-aggr subcommand above
corresponds to the key of the aggregation, which you can pick but
must be unique on your machine (a future version of dladm will pick one for you.)
That key is used as the PPA of the aggregation data-link that can configured
using ifconfig(1M). In the example above, the specified key value was 1, so the data-link
name is aggr1.
Another useful dladm(1M) you may also find useful for now is show-aggr,
which allows you to display the status of an aggregation and its constituent
ports, as well as traffic distribution statistics, etc.
Technorati Tag: Solaris
Posted by droux
( Feb 23 2006, 12:12:59 AM PST )
Permalink
Comments [2]
Solaris Link Aggregations (1): The Architecture
One of my recent jobs was to architect and implement the Solaris Link Aggregation component of project GLDv3, a.k.a. Nemo, which has been part of OpenSolaris since day one, and recently shipped as part of Solaris 10 Update 1. (One of my other jobs was Technical Lead of GLDv3 itself for its integration into Solaris 11/Nevada, but that's a story worth a separate blog entry.) Link aggregations consist of groups of Network Interface Cards (NICs) that provide increased bandwidth and higher availability. Network traffic is distributed among the members of an aggregation, and the failure of a single NIC should not affect the availability of the aggregation as long as there are other functional NICs in the same group. Link aggregations have been successfully deployed on Solaris within Sun as well as in customers production environments. Since there has been a lot of interest in that area, I decided to start a series of short articles introducing the concepts behind this feature, its implementation in Solaris, and how it can be (easily!) deployed. For now I will give an overview of the aggregation architecture, and will dig deeper into details in future articles. The following figure represents the aggregation driver and how it relates to other Nemo components. It does not represent the data paths, which are slightly different than what is represented there. The MAC layer, which is part of GLDv3, is the central point of access to Network Interface Cards (NICs) in the kernel. At the top, it provides a client interface that allows a client to send and receive packets to and from NICs, as well as configure, stop and start NICs. A the bottom, the MAC layer provides a provider interface which is used by NIC drivers to interface with the network stack. In the figure above, the client is the Data-Link Service (DLS) which provides SAP demultiplexing and VLAN support for the rest of the stack. The Data-Link Driver (DLD) provides a STREAMS interface between Nemo and DLPI consumers. We'll get into more details on DLS and DLD, which are also part of GLDv3, in a future article. Sunay also posted a general description of these components in his blog. The core of the link aggregation feature is provided by the "aggr" kernel pseudo driver. This driver acts as both a MAC client and a MAC provider. The aggr driver implements a MAC provider interface so that it looks like any other MAC device, which allows us to manage aggregation devices as if they were a regular NIC from the rest of Solaris. We'll discuss the source of aggr in a future article. Each aggregation of NICs is called an "aggregation group". Aggregation groups are uniquely identified by a key, an integer value unique on the system. We'll talk more about key values when we get into the administrative model. Note there is only one pseudo instance of the aggr driver. Each aggregation group is instanciated as a MAC port of that pseudo instance. Aggregations are managed (i.e. created, deleted, modified, queried) by the dladm(1M) command line utility, which communicates with the aggregation driver through a private control interface. The aggregation group is also a consumer of the MAC client interface, which it uses to control the individual NICs that are part of aggregation groups. The aggregation driver controls (starts/stops/etc) individual NICs, and sets the MAC address of the individual NICs according to the MAC address of the aggregation itself, which can be automatically picked from one of the constituents ports, or set statically through dladm(1M). The aggregation driver also specifies the send and receive routines needed for the transmission of packets through the aggregation. Another advantage of the MAC layer and its use by the aggregation driver is that any GLDv3 can be part of an aggregation, without any special support. Currently bge (1Gb/s Broadcom based), xge (10 Gb/s Neterion based), and e1000g (1Gb/s Intel based) devices can be combined to form link aggregations. Obviously there's a lot more to talk about. Stay tuned for future articles of this series with more information on the administration model, detailed design issues, and the data-path. Thanks for listening... Technorati Tag: Solaris Let's get the ball rolling It's been a busy 2004, and 2005 is already on its way to break all records... You are reading the first entry of my blog, Welcome! But a short introduction is in order. I am part of the Solaris Network Performance Team. We made "Solaris 10":http://www.sun.com/software/solaris network stack *fast*, you must surely have heard of "FireEngine":http://www.sun.com/bigadmin/content/networkperf , and we're just getting started. Solaris is going full throttle. So expect to find more very soon about the projects we are wrapping up, tips, behind the scenes, and more general Solaris information, stay tuned! Posted by droux ( Mar 25 2005, 01:21:07 AM PST ) Permalink Comments [2] |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||