Monday May 04, 2009
Monday May 04, 2009
One of my former colleagues, Joe Yanushpolsky (josephy100 -AT- gmail.com) was recently involved in the movement of a latency-sensitive Linux application to Solaris as part of platform consolidation. The code was old and it required access to kernel routines not available under BrandZ. Using VirtualBox as a virtual x86 system, the task was easier than expected.
Here is Joe's report of his tests.
The goals included allowing many people to independently run this application while sharing a server. It would be important to isolate each user from other users. But the resource controls included with VirtualBox were not sufficiently granular for the overall purpose. Solaris Containers (zones) have a richer set of resource controls. Would it be possible to combine Containers and VirtualBox?
The answer was 'yes' - I tried two slightly different methods. Each method starts by installing VirtualBox in the global zone to set up a device entry and some of the software. Details are provided later. After that is complete, the two methods differ.
When the process is complete, you have a guest OS, shown here via X Windows.
Not only did the code run well but it did so in a sparse root non-global zone
Well that was easy! How about Windows?
Now, this is interesting. As long as the client VM is supported by VirtualBox, it can be installed and run in a Solaris/OpenSolaris Container. I immediately thought of several useful applications of this combination of virtualization technologies:
.
Here are the highlights of "How to install." For more details, follow instructions in the VirtualBox User manual.
Friday May 01, 2009
I fixed all three bugs and posted v1.4.1 on the web site: http://opensolaris.org/os/project/zonestat/
Specifically:
Wednesday Apr 08, 2009
I have posted Zonestat v1.4 at: the Zone Statistics project page (click on "Files" in the left navbar).
Zonestat is a 'dashboard' for Solaris Containers. It shows resource consumption of each Container (aka Zone) and a comparison of consumption against limits you have set.
Changes from v1.3:
Note that the addition of a timestamp to -P output changes the output format for "machine-readable" output.
For most people, the most important change will be the use of DTrace to collect CPU% data. This has two effects. The first effect is improved correctness. The prstat command - used in V1.3 and earlier, can horribly underestimate CPU cycles consumed because it can miss many short-lived processes. The mpstat has its own problems with mis-counting CPU usage. So I expanded on a solution Jim Fiori offered, which uses DTrace to answer the question "which zone is using a CPU right now?"
The other benefit to DTrace is the improvement in performance of Zonestat.
The less popular, but still interesting additions include:
Please send questions and requests to zones-discuss@opensolaris.org .
Wednesday Apr 01, 2009
If you have patched a system with many zones, you have learned that it takes longer than patching a system without zones. The more zones there are, the longer it takes. In some cases, this can raise application downtime to an unacceptable duration.
Fortunately, there are a few methods which can be used to reduce application downtime. This document mentions many of them, and then describes the performance enhancements of two of them. But the bulk of this rather bulky entry is the description and results of my newest computer... "experiments."
| Executive Summary, for the Attention-Span Challenged |
It's important to distinguish between application downtime, service downtime, zone downtime, and platform downtime. 'Service' is the service being provided by an application or set of applications. To users, that's the most important measure. As long as they can access the service, they're happy. (Doesn't take much, does it?)
If a service depends on the proper operation of each of its component applications, planned or unplanned downtime of one application will result in downtime of the service. Some software, e.g. web server software, can be deployed in multiple, load-balanced systems so that the service will not experience downtime even if one of the software instances is down.
Applying an operating system patch may require service downtime, application downtime, zone downtime or platform downtime, depending on the patch and the entity being patched. Because in many cases patch application will require application downtime, the goal of the methods mentioned below, especially parallel patching of zones, is to minimize elapsed downtime to achieve a patched, running system.
Disclaimer 1: the "Zones Parallel Patching" patch ("ZPP") is still in testing and has not yet been released. It is expected to be released mid-CY2009. That may change. Further, the specific code changes may change, which may change the results described below.
Disclaimer 2: the experiment described below, and its results, are specific to one type of system (Sun Fire T2000) and one patch (120534-14 - "the Apache patch"). Performance improvements using other hardware and other patches will produce different results.
I wanted to better understand two methods of accelerating the patching of zoned systems, especially when used in combination. Currently, a patch applied to the global zone will normally be applied to all non-global zones, one zone at a time. This is a conservative approach to the task of patching multiple zones, but doesn't take full advantage of the multi-tasking abilities of Solaris.
I learned that a proposed patch was created that enables the system administrator to apply a patch in the global zone which patches the global and then patches multiple zones at the same time. The parallelism (i.e. "the number of zones that are patched at one time") can be chosen before the patch is applied. If there are multiple "Solaris CPUs" in the system, multiple CPUs can be performing computational steps at the same time. Even if there aren't many CPUs, one zone's patching process can be using a CPU while another's is writing to a disk drive.
<tangent topic="Solaris vCPU"> I use the phrase "Solaris CPUs" to refer to the view that Solaris has of CPUs. In the old days, a CPU was a CPU - one chip, one computational entity, one ALU, one FPU, etc. Now there are many factors to consider - CPU sockets, CPU cores per socket, hardware threads per core, etc. Solaris now considers "vCPUs" - virtual processors - as entities on which to schedule processors. Solaris considers each of these a vCPU:
Separately, I realized that one part of patching is disk-intensive. Many disk-intensive workloads benefit from writing to a solid-state disk (SSD) because of the performance benefit of those devices over spinning-rust disk drives (HDD).
So finally (hurrah!) the goal of this adventure: how much performance advantage would I achieve with the combination of parallel patching and an SSD, compared to sequential patching of zones on an HDD?
I took advantage of an opportunity to test both of these methods to accelerate patching. The system was a Sun Fire T2000 with two HDDs and one SSD. The system had 32 vCPUs, was not using Logical Domains, and was running Solaris 10 10/08. Solaris was installed on the first HDD. Both HDDs were 72GB drives. The SSD was a 32GB device. (Thank you, Pat!)
For some of the tests I also applied the ZPP. (Thank you, Enda!) For some of the tests I used zones that had zonepaths on the SSD; the rest 'lived' on the second HDD.
As with all good journeys, this one had some surprises. And, as with all good research reports, this one has a table with real data. And graphs later on.
To get a general feel for the different performance of an HDD vs. an SSD, I created a zone on each - using the secondary HDD - and made clones of it. Some times I made just one clone at a time, other times I made ten clones simultaneously. The iostat(1) tool showed me the following performance numbers:
| r/s | w/s | kr/s | kw/s | wait | actv | svc_t | %w | %b | |
|---|---|---|---|---|---|---|---|---|---|
| clone x1 on HDD | 56 | 227 | 833 | 3323 | 0 | 6.4 | 23 | 1 | 72 |
| clone x1 on SSD | 35 | 379 | 115 | 6165 | 0 | 0 | 0 | 1 | 6 |
| clone x10 on HDD | 35 | 470 | 182 | 3274 | 31 | 95 | 246 | 25 | 99 |
| clone x10 on SSD | 354 | 2958 | 2413 | 30262 | 4 | 15 | 4 | 10 | 34 |
At light load - just one clone at a time - the SSD performs better than the HDD, but at heavy load the SSD performs much much better, e.g. nine times the write throughput and 13x the write IOPS of the HDD, and the device driver and SSD still have room for more (34% busy vs. 99% busy).
Cloning a zone consists almost entirely of copying files. Patching has a higher proportion of computation, but those results gave me high hopes for patching. I wasn't disappointed. (Evidently, every good research report also includes foreshadowing.)
In addition to measuring the performance boost of the ZPP I wanted to know if that patch would help - or hurt - a system without using its parallelization feature. (I didn't have a particular reason to expect non-parallelized improvement, but occasionally I'm an optimist. Besides, if the performance with the ZPP was different without actually using parallelization, it would skew the parallelized numbers.) So before installing the patch, I measured the length of time to apply a patch. For all of my measurements, I used patch 120543-14 - the Apache patch. At 15MB, it's not a small patch, nor is it very large patch. (The "Baby Bear" patch, perhaps? --Ed.) It's big enough to tax the system and allow reasonable measurements, but small enough that I could expect to gather enough data to draw useful conclusions, without investing a year of time...
So, before applying the ZPP, and without any zones on the system, I applied the Apache patch. I measured the elapsed time because our goal is to minimize elapsed time of patch application. Then I removed the Apache patch.
Then I added a zone to the system, on the secondary HDD, and, I re-applied the Apache patch to the system, which automatically applied it to the zone as well. I removed the patch, created two more zones, and applied the same patch yet again. Finally, I compared the elapsed time of all three measurements. Patching the global zone alone took about 120 seconds. Patching with one non-global zone took about 175 seconds: 120 for the global zone and 55 for the zone. Patching three zones took about 285 seconds: 120 seconds for the global zone and 55 seconds for each of the three zones.
Theoretically, the length of time to patch each zone should be consistent. Testing that theory, I created a total of 16 zones and then applied the Apache patch. No surprises: 55 seconds per zone.
To test non-parallel performance of the ZPP, I applied it, kept the default setting of "no parallelization," and then re-did those tests. Application of the Apache patch did not change in behavior nor in elapsed time per zone, from zero to 16 zones. (However, I had a faint feeling that Solaris was beginning to question my sanity. "Get inline," I told it...)
How about the SSD - would it improve patch performance with zero or more zones? I removed the HDD zones and installed a succession of zones - zero to 16 - on the SSD and applied the Apache patch each time. The SSD did not help at all - the patch still took 55 seconds per zone. Evidently this particular patch is not I/O bound, it is CPU bound.
But applying the ZPP does not, by default, parallelize anything. To tell the patch tools that you would like some level of parallelization, e.g. "patch four zones at the same time," you must edit a specific file in the /etc/patch directory and supply a number, e.g. '4'. After you have done that, if parallel patching is possible, it will happen automatically. Multiple zones (e.g. four) will be patched at the same time by a patchadd process running in each zone. Because that patchadd is running in a zone, it will use the CPUs that the zone is assigned to use - default or otherwise. This also means that the zone's patchadd process is subject to all of the other resource controls assigned to the zone, if any.
That seems like magic! What's the trick? A few paragraphs back, I mentioned the 'trick': using all of the scalability of Solaris and, in this case, CMT systems. Patching a system without ZPP - especially one without a running application - leaves plenty of throughput performance "on the table." Patching muliple zones simultaneously uses CPU cycles - presumably cycles that would have been idle. And it uses I/O channel and disk bandwidth - also, hopefully, available bandwidth. Essentially, ZPP is shortening the elapsed time by using more CPU cycles and I/O bandwidth now instead of using them later.
So the main caution is "make sure there is sufficient compute and I/O capacity to patch multiple zones at the same time."
But whenever multiple apps are running on the same system at the same time, the operating system must perform extra tasks to enable them to run safely. It doesn't matter if the "app" is a database server or 'patchadd.' So is ZPP using any "extra" CPU, i.e. is there any CPU overhead?
| Along the way, I collected basic CPU statistics, including system and user time. The next graph shows that the amount of total CPU time (user+sys) increased slightly. The overhead was less than 10% for up to 8 zones. Another coincidence? I don't know, but at that point the overhead was roughly 1% per zone. The overhead increased faster beyond P=8, indicating that, perhaps, a good rule of thumb is P="number of unused cores." Of course, if the system is using Dynamic Resource Pools or dedicated-cpus, the rule might need to be changed accordingly. TANSTAAFL. |
|
Getting the ZPP requires waiting until mid-year. Getting SSDs is easy - they're available for the Sun 7210 and 7410 Unified Storage Systems and for Sun systems.
Tuesday Feb 10, 2009
Recently, Thomson Reuters "demonstrated that RMDS [Reuters Marked Data Systems software] performs better in a virtualized environment with Solaris Containers than it does with a number of individual Sun server machines."
This enabled Thomson Reuters to break the "million-messages-per-second barrier."
The performance improvement is probably due to the extremely high bandwidth, low latency characteristics of inter-Container network communications. Because all inter-Container network traffic is accomplished with memory transfers - using default settings - packets 'move' at computer memory speeds, which are much better than common 100Mbps or 1Gbps ethernet bandwidth. Further, that network performance is much more consistent without extra hardware - switches and routers - that can contribute to latency.
Articles can be found at: http://finance.yahoo.com/news/Sun-Microsystems-and-Thomson-bw-14306924.html
Thursday Jan 29, 2009
The most significant new functionality is a feature called
"Zone
Clusters"
which, at this point, 'merely'
provides support for Oracle RAC nodes
in Zones. In other words, you can create an Oracle RAC cluster, using
individual zones in a Solaris Cluster as RAC nodes.
Further, because a Solaris Cluster can contain multiple Zone Clusters, it can contain multiple Oracle RAC clusters. For details about configuring a zone cluster, see "Configuring a Zone Cluster" in the Sun Cluster Software Installation Guide and the clzonecluster(1CL) man page.
The second new feature is support for exclusive-IP zones. Note that this only applies to failover data services, not scalable data services nor with zone clusters.
Friday Jan 09, 2009
To shine some light on the topic of Zones and security, Glenn Brunette and I recently co-authored a new Sun BluePrint with an overly long name
- "Understanding the Security Capabilities of Solaris Zones Software." You can find it at
http://www.sun.com/blueprints.
Monday Nov 24, 2008
It's - already - time for a zonestat update. I was never happy with the method that zonestat used to discover the mappings of zones to resource pools, but wanted to get v1.1 "out the door" before I had a chance to improve on its use of zonecfg(1M). The obvious problem, which at least one person stumbled over, was the fact that you can re-configure a zone while it's running. After doing that, the configuration information doesn't match the current mapping of zone to pool, and zonestat became confused.
Anyway, I found the time to replace the code in zonestat which discovered zone-to-pool mappings with a more sophisticated method. The new method uses ps(1) to learn the PID that each zone's [z]sched process is. Then it uses "poolbind -q <PID>" to look up the pool for that process. The result is more accurate data, but the ps command does use more CPU cycles.
While performing surgery on zonestat, I also:
You can find it at
http:://opensolaris.org/os/project/zonestat.
Tuesday Nov 18, 2008
I began to list them: prstat(1M), poolstat(1M), ipcs(1), kstat(1M), rcapstat(1), prctl(1), ... Obviously it would be easier to monitor Containers if there was one 'dashboard' to view. Such a dashboard would enable zone administrators to easily review zones' usage of system resources and decide if further investigation is necessary. Also, if there is a system-wide shortage of a resource, this tool would be the first out of the toolbox, simplifying the task of finding the 'resource hog.'
Its output looks like this:
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 100 25 986M 139K 18E 2.2M 18E 754M
db01 0D 66K 2 0.1 200 50 1G 122M 536M 0.0 536M 0 1G 135M
web02 0D 66K 2 0.4 0.0 100 25 100M 11M 20M 0.0 20M 0 268M 8M
==TOTAL= --- ---- 2 ---- 0.2 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
The 'Pool' columns provide information about the
Dynamic Resource Pool
in which the zone's processes are running. The two-character 'I' column
displays the pool ID (number) and the 'T' column indicates the type of
pool - 'D' for 'default', 'P' for 'private' (using the dedicated-cpu
feature of zonecfg) or 'S' for 'shared.' The two 'Size' columns show
the quantity of CPUs assigned to the pool in which the zone is running.
The 'CPU Pset' columns show each zone's CPU usage and any caps that have been set. The first two columns show CPU quantities - CPU cores for x86, SPARC64 and all UltraSPARC systems except CMT (T1, T2, T2+). On CMT systems, Solaris considers every hardware thread ('strand') to be a CPU, and calls them 'vCPUs.'
The last two CPU columns - 'Shr' and 'S%' - show the number of FSS shares assigned to the zone, and what percentage of the total number of shares in that zone's pool. In the example above, all the zones share the default pset, and the zone 'db01' has two shares, so it should receive 50% of the CPU power of the pool at a minimum.
The 'Memory' columns show the caps and usage for RAM, locked memory and virtual memory. Note that virtual memory is RAM plus swap space.
The syntax of zonestat is very similar to the other *stat tools:
zonestat [-l] [interval [count]]The output shown above is generated with the -l flag, which means "show the limits (caps) that have been set." Without -l, only usage columns are displayed.
Here is more output, showing some of the conclusions that can be drawn from the data. I have added parenthetical numbers in the right-hand in order to refer to specific lines of output.
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 100 HH 983M 139K 18E 2.2M 18E 752M
==TOTAL= --- ---- 2 ---- 0.1 -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
--------
global 0D 66K 2 0.1 100 HH 983M 139K 18E 2M 18E 752M
==TOTAL= --- ---- 2 ---- 0.1 -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
Note that the none of the non-global zones are running. Because the global
zone is the only zone running in its pool, its 100 FSS shares represent
100% of the shares in its pool. To save a column of output, I indicate
that with 'HH' instead of '100'.
The "==TOTAL=" line provides two types of information, depending on the column type. For usage information, the sum of the resource used is shown. For example, "RAM Use" shows the amount of RAM used by all zones, including the global zone. For resource controls, either the system's amount of the resource is shown, e.g. "RAM Cap", or hyphens are displayed.
Note that there is a maximum amount of RAM that can be locked in a Solaris system. This prevents all of memory from being locked down, which would prevent the virtual memory system from running. In the output above, this system will only allow 4.1GB of RAM to be locked.
Also note that the amount of VM used is less than the amount of RAM used. This is because the memory pages which contain a program's instructions are not backed by swap disk, but by the file system itself. Those 'text' pages take up RAM, but do not take up swap space.
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.1 100 50 1.0G 30M 536M 0.0 536M 0.0 1.0G 27M
==TOTAL= --- ---- 2 ---- 0.2 -- -- 4.3G 1.0G 4.3G 139K 4.1G 2.2M 5.3G 780M
A zone has booted. It has caps for RAM, shared memory, locked
memory, and VM. The default pool now has a total of 200 shares: 100
for each zone. Therefore, each zone has 50% of the shares in that pool.
This provides a good reason to change the global zone's FSS value from
its default of one share to a larger value as soon as you add the first
zone to a system.
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.3 100 50 1G 93M 536M 0.0 536M 0.0 1G 95M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 848M
--------
global 0D 66K 2 0.1 100 50 981M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.4 100 50 1G 122M 536M 0.0 536M 0.0 1G 135M
==TOTAL= --- ---- 2 ---- 0.5 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
The zone 'z3' is still booting, and is using 0.4 CPUs worth of CPU cycles.
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.3 100 50 1G 122M 536M 536M 0.0 1G 135M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.2 100 50 1G 122M 536M 536M 0.0 1G 135M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
global 0D 66K 2 0.1 100 33 986M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.1 100 33 1G 122M 536M 536M 0.0 1G 135M
web02 0D 66K 2 0.4 0.0 100 33 100M 11M 20M 20M 0.0 268M 8M
==TOTAL= --- ---- 2 ---- 0.2 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
A third zone has booted. This zone has a
CPU-cap
of 0.4 CPUs. It also has
memory caps, including a RAM cap that is less than the amount of memory that
zone 'z3' is using. If the two zones need the same amount, web02 should begin
paging before long. Let's see what happens...
--------
global 0D 66K 2 0.1 1 33 985M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.1 1 33 1G 122M 536M 536M 0.0 1G 135M
web02 0D 66K 2 0.4 0.1 1 33 100M 29M 20M 20M 0.0 268M 36M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 925M
--------
global 0D 66K 2 0.1 1 33 984M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.1 1 33 1G 122M 536M 536M 0.0 1G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 63M 20M 20M 0.0 268M 138M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 1 33 985M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 122M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 87M 20M 20M 0.0 268M 185M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.1G
--------
global 0D 66K 2 0.1 1 33 985M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 122M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 100M 20M 20M 0.0 268M 112M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
global 0D 66K 2 0.1 1 33 984M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 122M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.3 1 33 100M 112M 20M 20M 0.0 268M 117M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
As expected, web02 exceeds its RAM cap. Now rcapd should address the problem.
--------
global 0D 66K 2 0.1 1 33 981M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 119M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.3 1 33 100M 111M 20M 20M 0.0 268M 127M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
One of two things has happened: either a process in web02 freed up memory, or
rcapd caused pageouts. rcapstat(1M) will tell us which it is. Also, the
increase in VM usage indicates that more memory was allocated than freed, so
it's more likely that rcapd was active during this period.
--------
global 0D 66K 2 0.1 1 33 981M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1.0G 119M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 110M 20M 20M 0.0 268M 133M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
global 0D 66K 2 0.1 1 33 978M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1.0G 116M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 91M 20M 20M 0.0 268M 133M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
At this point 'web02' is safely under its RAM cap. If this zone began to do
'real' work, it would continually be under memory pressure, and the value in
'Memory:RAM:Use" would fluctuate around 100M. When setting a RAM cap, it
is very important to choose a reasonable valuable to avoid causing
unnecessary paging.
One final example, taken from a different configuration of zones:
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap|Used| Cap|Used| Cap|Used| Cap|Used
-------------------------------------------------------------------------------
global 0D 66K 1 0.0 0.0 200 66 1.2G 18E 343K 18E 2.6M 18E 1.1G
zB 0D 66K 1 0.2 0.0 100 33 124M 18E 0.0 18E 0.0 18E 138M
zA 1P 1 1 0.0 0.1 1 HH 31M 18E 0.0 18E 0.0 18E 24M
==TOTAL= --- ---- 2 ---- 0.1 --- -- 4.3G 1.4G 4.3G 343K 4.1G 2.6M 5.3G 1.2G
The global zone and zone 'zB' share the default pool. Because the global
zone has 200 FSS shares, compared to zB's 100 shares, global zone
processes will get 2/3 of the processing power of the default pool if
there is contention for that CPU. However, that is unlikely, because zB
is capped at 0.2 CPUs worth of compute time.
Zone 'zA' is in its own private resource pool. It has exclusive access to the one dedicated CPU in that pool.
Zonestat's biggest problem is due to its brute-force nature. It runs a few commands for each zone that is running. This can consume many CPU cycles, and can take a few seconds to run with many zones. Performance improvements to zonestat are underway.
With mpstat, it is not difficult to create surprising results, e.g. on a CMT system, set a CPU cap on a zone in a pool, and run a few CPU-bound processes: the "Pset Used" column will not reach the CPU cap. This is due to the method used by mpstat to calculate its data.
Prstat only computes its data occasionally, ignoring anything that happened between samples. This leads to undercounting CPU usage for zones with many short-lived processes.
I wrote code to gather data from each, but prstat seemed more useful, so for now the output comes from prstat.
The future of zonestat might include these:
Friday May 02, 2008
This is particularly interesting for Sun's CMT systems - those systems based on the UltraSPARC-T1, -T2, and -T2+ (aka Niagara, Niagara 2, Niagara 2+). Those systems are well known for the high performance-per-watt characteristics, an important consideration as data centers exhaust their power capacity and the price of fossil fuels rise.
Solaris 8 (and 9) Containers can also take advantage of the impressive scalability of the Sun SPARC Enterprise M-series systems - from 4 to 64 dual-core SPARC CPUs. Because of the ability to mix Solaris 8 Containers and Solaris 9 Containers, alongside Solaris 10 Containers, you can move dozens of older SPARC systems into just a few new SPARC systems.
You can find product details, a videotaped demonstration, and free download at http://www.sun.com/software/solaris/containers/index.jsp.
Wednesday Apr 09, 2008
Since 2005, Solaris 10 has offered the Solaris Containers feature set, creating isolated virtual Solaris environments for Solaris 10 applications. Although almost all Solaris 8 applications run unmodified in Solaris 10 Containers, sometimes it would be better to just move an entire Solaris 8 system - all of its directories and files, configuration information, etc. - into a Solaris 10 Container. This has become very easy - just three commands.
Sun offers a Solaris Binary Compatibility Guarantee which demonstrates the significant effort that Sun invests in maintaining compatibility from one Solaris version to the next. Because of that effort, almost all applications written for Solaris 8 run unmodified on Solaris 10, either in a Solaris 10 Container or in the Solaris 10 global zone.
However, there are still some data centers with many Solaris 8 systems. In some situations it is not practical to re-test all of those applications on Solaris 10. It would be much easier to just move the entire contents of the Solaris 8 file systems into a Solaris Container and consolidate many Solaris 8 systems into a much smaller number of Solaris 10 systems.
For those types of situations, and some others, Sun now offers Solaris 8 Containers. These use the "Branded Zones" framework available in OpenSolaris and first released in Solaris 10 in August 2007. A Solaris 8 Container provides an isolated environment in which Solaris 8 binaries - applications and libraries - can run without modification. To a user logged in to the Container, or to an application running in the Container, there is very little evidence that this is not a Solaris 8 system.
The Solaris 8 Container technology rests on a very thin layer of software which performs system call translations - from Solaris 8 system calls to Solaris 10 system calls. This is not binary emulation, and the number of system calls with any difference is small, so the performance penalty is extremely small - typically less than 3%.
Not only is this technology efficient, it's very easy to use. There are five steps, but two of them can be combined into one:
Almost any Solaris 8 revision or patch level will work, but Sun strongly recommends applying the most recent patches to that system. The Solaris 10 system must be running Solaris 10 8/07, and requires the following minimum patch levels:
s10-system# pkgadd -d . SUNWs8brandr SUNWs8brandu SUNWs8p2vNow we can patch the Solaris 10 system, using the patches listed above.
After patches have been applied, it's time to archive the Solaris 8 system. In order to remove the "archive transfer" step I'll turn the Solaris 10 system into an NFS server and mount it on the Solaris 8 system. The archive can be created by the Solaris 8 system, but stored on the Solaris 10 system. There are several tools which can be used to create the archive: Solaris flash archive tools, cpio, pax, etc. In this example I used flarcreate, which first became available on Solaris 8 2/04.
s10-system# share /export/home/s8-archives
s8-system# mount s10-system:/export/home/s8-archives /mnt s8-system# flarcreate -S -n atl-sewr-s8 /mnt/atl-sewr-s8.flarCreation of the archive takes longer than any other step - 15 minutes to an hour, or even more, depending on the size of the Solaris 8 file systems.
With the archive in place, we can configure and install the Solaris 8 Container. In this demonstration the Container was "sys-unconfig'd" by using the -u option. The opposite of that is -p, which preserves the system configuration information of the Solaris 8 system.
s10-system# zonecfg -z test8
zonecfg:test8> create -t SUNWsolaris8
zonecfg:test8> set zonepath=/zones/roots/test8
zonecfg:test8> add net
zonecfg:test8:net> set address=129.152.2.81
zonecfg:test8:net> set physical=vnet0
zonecfg:test8:net> end
zonecfg:test8> exit
s10-system# zoneadm -z test8 install -u -a /export/home/s8-archives/atl-sewr-s8.flar
Log File: /var/tmp/test8.install.995.log
Source: /export/home/s8-archives/atl-sewr-s8.flar
Installing: This may take several minutes...
Postprocessing: This may take several minutes...
Result: Installation completed successfully.
Log File: /zones/roots/test8/root/var/log/test8.install.995.log
This step should take 5-10 minutes. After the Container has been
installed, it can be booted.
s10-system# zoneadm -z test8 boot s10-system# zlogin -C test8At this point I was connected to the Container's console. It asked the usual system configuration questions, and then rebooted:
[NOTICE: Zone rebooting] SunOS Release 5.8 Version Generic_Virtual 64-bit Copyright 1983-2000 Sun Microsystems, Inc. All rights reserved Hostname: test8 The system is coming up. Please wait. starting rpc services: rpcbind done. syslog service starting. Print services started. Apr 1 18:07:23 test8 sendmail[3344]: My unqualified host name (test8) unknown; sleeping for retry The system is ready. test8 console login: root Password: Apr 1 18:08:04 test8 login: ROOT LOGIN /dev/console Last login: Tue Apr 1 10:47:56 from vpn-129-150-80- Sun Microsystems Inc. SunOS 5.8 Generic Patch February 2004 # bash bash-2.03# psrinfo 0 on-line since 04/01/2008 03:56:38 1 on-line since 04/01/2008 03:56:38 2 on-line since 04/01/2008 03:56:38 3 on-line since 04/01/2008 03:56:38 bash-2.03# ifconfig -a lo0:1: flags=1000849At this point the Solaris 8 Container exists. It's accessible on the local network, existing applications can be run in it, or new software can be added to it, or existing software can be patched.mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 vnet0:1: flags=1000843 mtu 1500 index 2 inet 129.152.2.81 netmask ffffff00 broadcast 129.152.2.255
To extend the example, here is the output from the commands I used to limit this Solaris 8 Container to only use a subset of the 32 virtual CPUs on that Sun Fire T2000 system.
s10-system# zonecfg -z test8 zonecfg:test8> add dedicated-cpu zonecfg:test8:dedicated-cpu> set ncpus=2 zonecfg:test8:dedicated-cpu> end zonecfg:test8> exit bash-3.00# zoneadm -z test8 reboot bash-3.00# zlogin -C test8 Console: [NOTICE: Zone rebooting] SunOS Release 5.8 Version Generic_Virtual 64-bit Copyright 1983-2000 Sun Microsystems, Inc. All rights reserved Hostname: test8 The system is coming up. Please wait. starting rpc services: rpcbind done. syslog service starting. Print services started. Apr 1 18:14:53 test8 sendmail[3733]: My unqualified host name (test8) unknown; sleeping for retry The system is ready. test8 console login: root Password: Apr 1 18:15:24 test8 login: ROOT LOGIN /dev/console Last login: Tue Apr 1 18:08:04 on console Sun Microsystems Inc. SunOS 5.8 Generic Patch February 2004 # psrinfo 0 on-line since 04/01/2008 03:56:38 1 on-line since 04/01/2008 03:56:38Finally, to learn more about Solaris 8 Containers: For those who were counting, the "three commands" were, at a minimum, flarcreate, zonecfg and zoneadm.
Tuesday Apr 08, 2008
Solaris Containers have a 'zonepath' ('home') which can be a directory on the root file system or on a non-root file system. Until Solaris 10 8/07 was released, a local file system was required for this directory. Containers that are on non-root file systems have used UFS, ZFS, or VxFS. All of those are local file systems - putting Containers on NAS has not been possible. With Solaris 10 8/07, that has changed: a Container can now be placed on remote storage via iSCSI.
Each Container has its own root directory. Although viewed as the root directory from within that Container, that directory is also a non-root directory in the global zone. For example, a Container's root directory might be called /zones/roots/myzone/root in the global zone.
The configuration of a Container includes something called its "zonepath." This is the directory which contains a Container's root directory (e.g. /zones/roots/myzone/root) and other directories used by Solaris. Therefore, the zonepath of myzone in the example above would be /zones/roots/myzone.
The global zone administrator can choose any directory to be a Container's zonepath. That directory could just be a directory on the root partition of Solaris, though in that case some mechanism should be used to prevent that Container from filling up the root partition. Another alternative is to use a separate partition for that Container, or one shared among multiple Containers. In the latter case, a quota should be used for each Container.
Local file systems have been used for zonepaths. However, many people have strongly expressed a desire for the ability to put Containers on remote storage. One significant advantage to placing Containers on NAS is the simplification of Container migration - moving a Container from one system to another. When using a local file system, the contents of the Container must be transmitted from the original host to the new host. For small, sparse zones this can take as little as a few seconds. For large, whole-root zones, this can take several minutes - a whole-root zone is an entire copy of Solaris, taking up as much as 3-5 GB. If remote storage can be used to store a zone, the zone's downtime can be as little as a second or two, during which time a file system is unmounted on one system and mounted on another.
Here are some significant advantages to iSCSI over SANs:
Unfortunately, a Container cannot 'live' on an NFS server, and it's not clear if or when that limitation will be removed.
iSCSI is simply "SCSI communication over IP." In this case, SCSI commands and responses are sent between two iSCSI-capable devices, which can be general-purpose computers (Solaris, Windows, Linux, etc.) or specific-purpose storage devices (e.g. Sun StorageTek 5210 NAS, EMC Celerra NS40, etc.). There are two endpoints to iSCSI communications: the initiator (client) and the target (server). A target publicizes its existence. An initiator binds to a target.
The industry's design for iSCSI includes a large number of features, including security. Solaris implements many of those features. Details can be found:
In Solaris, the command iscsiadm(1M) configures an initiator, and the command iscsitadm(1M) configures a target.
The target system is an LDom on a T2000, and looks like this:
System Configuration: Sun Microsystems sun4v Memory size: 1024 Megabytes SUNW,Sun-Fire-T200 SunOS ldg1 5.10 Generic_127111-07 sun4v sparc SUNW,Sun-Fire-T200 Solaris 10 8/07 s10s_u4wos_12b SPARCThe initiator system is another LDom on the same T2000 - although there is no requirement that LDoms are used, or that they be on the same computer if they are used.
System Configuration: Sun Microsystems sun4v Memory size: 896 Megabytes SUNW,Sun-Fire-T200 SunOS ldg4 5.11 snv_83 sun4v sparc SUNW,Sun-Fire-T200 Solaris Nevada snv_83a SPARCThe first configuration step is the creation of the storage underlying the iSCSI target. Although UFS could be used, let's improve the robustness of the Container's contents and put the target's storage under control of ZFS. I don't have extra disk devices to give to ZFS, so I'll make some and use them for a zpool - in real life you would use disk devices here:
Target# mkfile 150m /export/home/disk0 Target# mkfile 150m /export/home/disk1 Target# zpool create myscsi mirror /export/home/disk0 /export/home/disk1 Target# zpool status pool: myscsi state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM myscsi ONLINE 0 0 0 /export/home/disk0 ONLINE 0 0 0 /export/home/disk1 ONLINE 0 0 0Now I can create a zvol - an emulation of a disk device:
Target# zfs list NAME USED AVAIL REFER MOUNTPOINT myscsi 86K 258M 24.5K /myscsi Target# zfs create -V 200m myscsi/jvol0 Target# zfs list NAME USED AVAIL REFER MOUNTPOINT myscsi 200M 57.9M 24.5K /myscsi myscsi/jvol0 22.5K 258M 22.5K -Creating an iSCSI target device from a zvol is easy:
Target# iscsitadm list target Target# zfs set shareiscsi=on myscsi/jvol0 Target# iscsitadm list target Target: myscsi/jvol0 iSCSI Name: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6 Connections: 0 Target# iscsitadm list target -v Target: myscsi/jvol0 iSCSI Name: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6 Alias: myscsi/jvol0 Connections: 0 ACL list: TPGT list: LUN information: LUN: 0 GUID: 0x0 VID: SUN PID: SOLARIS Type: disk Size: 200M Backing store: /dev/zvol/rdsk/myscsi/jvol0 Status: online
Configuring the iSCSI initiator takes a little more work. There are three methods to find targets. I will use a simple one. After telling Solaris to use that method, it only needs to know what the IP address of the target is.
Note that the example below uses "iscsiadm list ..." several times, without any output. The purpose is to show the difference in output before and after the command(s) between them.
First let's look at the disks available before configuring iSCSI on the initiator:
Initiator# ls /dev/dsk c0d0s0 c0d0s2 c0d0s4 c0d0s6 c0d1s0 c0d1s2 c0d1s4 c0d1s6 c0d0s1 c0d0s3 c0d0s5 c0d0s7 c0d1s1 c0d1s3 c0d1s5 c0d1s7We can view the currently enabled discovery methods, and enable the one we want to use:
Initiator# iscsiadm list discovery Discovery: Static: disabled Send Targets: disabled iSNS: disabled Initiator# iscsiadm list target Initiator# iscsiadm modify discovery --sendtargets enable Initiator# iscsiadm list discovery Discovery: Static: disabled Send Targets: enabled iSNS: disabledAt this point we just need to tell Solaris which IP address we want to use as a target. It takes care of all the details, finding all disk targets on the target system. In this case, there is only one disk target.
Initiator# iscsiadm list target
Initiator# iscsiadm add discovery-address 129.152.2.90
Initiator# iscsiadm list target
Target: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
Alias: myscsi/jvol0
TPGT: 1
ISID: 4000002a0000
Connections: 1
Initiator# iscsiadm list target -v
Target: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
Alias: myscsi/jvol0
TPGT: 1
ISID: 4000002a0000
Connections: 1
CID: 0
IP address (Local): 129.152.2.75:40253
IP address (Peer): 129.152.2.90:3260
Discovery Method: SendTargets
Login Parameters (Negotiated):
Data Sequence In Order: yes
Data PDU In Order: yes
Default Time To Retain: 20
Default Time To Wait: 2
Error Recovery Level: 0
First Burst Length: 65536
Immediate Data: yes
Initial Ready To Transfer (R2T): yes
Max Burst Length: 262144
Max Outstanding R2T: 1
Max Receive Data Segment Length: 8192
Max Connections: 1
Header Digest: NONE
Data Digest: NONE
The initiator automatically finds the iSCSI remote storage, but
we need to turn this into a disk device. (Newer builds seem to not
need this step, but it won't hurt. Looking in /devices/iscsi will
help determine whether it's needed.)
Initiator# devfsadm -i iscsi Initiator# ls /dev/dsk c0d0s0 c0d0s3 c0d0s6 c0d1s1 c0d1s4 c0d1s7 c1t7d0s2 c1t7d0s5 c0d0s1 c0d0s4 c0d0s7 c0d1s2 c0d1s5 c1t7d0s0 c1t7d0s3 c1t7d0s6 c0d0s2 c0d0s5 c0d1s0 c0d1s3 c0d1s6 c1t7d0s1 c1t7d0s4 c1t7d0s7 Initiator# ls -l /dev/dsk/c1t7d0s0 lrwxrwxrwx 1 root root 100 Mar 28 00:40 /dev/dsk/c1t7d0s0 -> ../../devices/iscsi/disk@0000iqn.1986-03.com.sun%3A02%3Ac8a82272-b354-c913-80f9-db9cb378a6f60001,0:aNow that the local device entry exists, we can do something useful with it. Installing a new file system requires the use of format(1M) to partition the "disk" but it is assumed that the reader knows how to do that. However, here is the first part of the format dialogue, to show that format lists the new disk device with its unique identifier - the same identifier listed in /devices/iscsi.
Initiator# format
Searching for disks...done
c1t7d0: configured with capacity of 199.98MB
AVAILABLE DISK SELECTIONS:
0. c0d0
/virtual-devices@100/channel-devices@200/disk@0
1. c0d1
/virtual-devices@100/channel-devices@200/disk@1
2. c1t7d0
/iscsi/disk@0000iqn.1986-03.com.sun%3A02%3Ac8a82272-b354-c913-80f9-db9cb378a6f60001,0
Specify disk (enter its number): 2
selecting c1t7d0
[disk formatted]
Disk not labeled. Label it now? no
Let's jump to the end of the partitioning steps, after assigning all of
the available disk space to partition 0:
partition> print Current partition table (unnamed): Total disk cylinders available: 16382 + 2 (reserved cylinders) Part Tag Flag Cylinders Size Blocks 0 root wm 0 - 16381 199.98MB (16382/0/0) 409550 1 unassigned wu 0 0 (0/0/0) 0 2 backup wu 0 - 16381 199.98MB (16382/0/0) 409550 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 unassigned wm 0 0 (0/0/0) 0 partition> label Ready to label disk, continue? yThe new raw disk needs a file system.
Initiator# newfs /dev/rdsk/c1t7d0s0
newfs: construct a new file system /dev/rdsk/c1t7d0s0: (y/n)? y
/dev/rdsk/c1t7d0s0: 409550 sectors in 16382 cylinders of 5 tracks, 5 sectors
200.0MB in 1024 cyl groups (16 c/g, 0.20MB/g, 128 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 448, 864, 1280, 1696, 2112, 2528, 2944, 3232, 3648,
Initializing cylinder groups:
....................
super-block backups for last 10 cylinder groups at:
405728, 406144, 406432, 406848, 407264, 407680, 408096, 408512, 408928, 409344
Back on the target:
Target# zfs list NAME USED AVAIL REFER MOUNTPOINT myscsi 200M 57.9M 24.5K /myscsi myscsi/jvol0 32.7M 225M 32.7M -Finally, the initiator has a new file system, on which we can install a zone.
Initiator# mkdir /zones/newroots Initiator# mount /dev/dsk/c1t7d0s0 /zones/newroots Initiator# zonecfg -z iscuzone iscuzone: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:iscuzone> create zonecfg:iscuzone> set zonepath=/zones/newroots/iscuzone zonecfg:iscuzone> add inherit-pkg-dir zonecfg:iscuzone:inherit-pkg-dir> set dir=/opt zonecfg:iscuzone:inherit-pkg-dir> end zonecfg:iscuzone> exit Initiator# zoneadm -z iscuzone install Preparing to install zoneThere it is: a Container on an iSCSI target on a ZFS zvol.. Creating list of files to copy from the global zone. Copying <2762> files to the zone. Initializing zone product registry. Determining zone package initialization order. Preparing to initialize <1162> packages on the zone. ... Initialized <1162> packages on zone. Zone is initialized. Installation of these packages generated warnings: The file contains a log of the zone installation.
You can use Solaris Live Upgrade to patch or upgrade a system with Containers. If the Containers are on a traditional file system which uses UFS (e.g. /, /export/home) LU will automatically do the right thing. Further, if you create a UFS file system on an iSCSI target and install one or more Containers on it, the ABE will also need file space for its copy of those Containers. To mimic the layout of the original BE you could use another UFS file system on another iSCSI target. The lucreate command would look something like this:
# lucreate -m /:/dev/dsk/c0t0d0s0:ufs -m /zones:/dev/dsk/c1t7d0s0:ufs -n newBE
Friday Mar 21, 2008
Here's another example of Containers that can manage their own affairs.
Sometimes you want to closely manage the devices that a Solaris Container uses. This is easy to do from the global zone: by default a Container does not have direct access to devices. It does have indirect access to some devices, e.g. via a file system that is available to the Container.
By default, zones use NICs that they share with the global zone, and perhaps with other zones. In the past these were just called "zones." Starting with Solaris 10 8/07, these are now referred to as "shared-IP zones." The global zone administrator manages all networking aspects of shared-IP zones.
Sometimes it would be easier to give direct control of a Container's devices to its owner. An excellent example of this is the option of allowing a Container to manage its own network interfaces. This enables it to configure IP Multipathing for itself, as well as IP Filter and other network features. Using IPMP increases the availability of the Container by creating redundant network paths to the Container. When configured correctly, this can prevent the failure of a network switch, network cable or NIC from blocking network access to the Container.
As described at docs.sun.com, to use IP Multipathing you must choose two network devices of the same type, e.g. two ethernet NICs. Those NICs are placed into an IPMP group through the use of the command ifconfig(1M). Usually this is done by placing the appropriate ifconfig parameters into files named /etc/hostname.<NIC-instance>, e.g. /etc/hostname.bge0.
An IPMP group is associated with an IP address. Packets leaving any NIC in the group have a source address of the IPMP group. Packets with a destination address of the IPMP group can enter through either NIC, depending on the state of the NICs in the group.
Delegating network configuration to a Container requires use of the new IP Instances feature. It's easy to create a zone that uses this feature, making this an "exclusive-IP zone." One new line in zonecfg(1M) will do it:
zonecfg:twilight> set ip-type=exclusiveOf course, you'll need at least two network devices in the IPMP group. Using IP Instances will dedicate these two NICs to this Container exclusively. Also, the Container will need direct access to the two network devices. Configuring all of that looks like this:
global# zonecfg -z twilight zonecfg:twilight> create zonecfg:twilight> set zonepath=/zones/roots/twilight zonecfg:twilight> set ip-type=exclusive zonecfg:twilight> add net zonecfg:twilight:net> set physical=bge1 zonecfg:twilight:net> end zonecfg:twilight> add net zonecfg:twilight:net> set physical=bge2 zonecfg:twilight:net> end zonecfg:twilight>add device zonecfg:twilight:device> set match=/dev/net/bge1 zonecfg:twilight:net> end zonecfg:twilight>add device zonecfg:twilight:device> set match=/dev/net/bge2 zonecfg:twilight:net> end zonecfg:twilight> exitAs usual, the Container must be installed and booted with zoneadm(1M):
global# zoneadm -z twilight install global# zoneadm -z twilight bootNow you can login to the Container's console and answer the usual configuration questions:
global# zlogin -C twilight <answer questions> <the zone automatically reboots>After the Container reboots, you can configure IPMP. There are two methods. One uses link-based failure detection and one uses probe-based failure detection.
Link-based detection requires the use of a NIC which supports this feature. Some NICs that support this are hme, eri, ce, ge, bge, qfe and vnet (part of Sun's Logical Domains). They are able to detect failure of the link immediately and report that failure to Solaris. Solaris can then take appropriate steps to ensure that network traffic continues to flow on the remaining NIC(s).
Other NICs do not support this link-based failure detection, and must use probe-based detection. This method uses ICMP packets ("pings") from the NICs in the IPMP group to detect failure of a NIC. This requires one IP address per NIC, in addition to the IP address of the group.
Regardless of the method used, configuration can be accomplished manually or via files /etc/hostname.<NIC-instance>. First I'll describe the manual method.
# ifconfig bge1 plumb # ifconfig bge1 twilight group ipmp0 up # ifconfig bge2 plumb # ifconfig bge2 group ipmp0 upNote that those commands only achieve the desired network configuration until the next time that Solaris boots. To configure Solaris to do the same thing when it next boots, you must put the same configuration information into configuration files. Inserting those parameters into configuration files is also easy:
/etc/hostname.bge1: twilight group ipmp0 upThose two files will be used to configure networking the next time that Solaris boots. Of course, an IP address entry for twilight is required in /etc/inet/hosts.
/etc/hostname.bge2: group ipmp0 up
If you have entered the ifconfig commands directly, you are finished. You can test your IPMP group with the if_mpadm command, which can be run in the global zone, to test an IPMP group in the global zone, or can be run in an exclusive-IP zone, to test one of its groups:
# ifconfig -a ... bge1: flags=201000843If you are using link-based detection, that's all there is to it!mtu 1500 index 4 inet 129.152.2.72 netmask ffff0000 broadcast 129.152.255.255 groupname ipmp0 ether 0:14:4f:f8:9:1d bge2: flags=201000843 mtu 1500 index 5 inet 0.0.0.0 netmask ff000000 groupname ipmp0 ether 0:14:4f:fb:ca:b ... # if_mpadm -d bge1 # ifconfig -a ... bge1: flags=289000842 mtu 0 index 4 inet 0.0.0.0 netmask 0 groupname ipmp0 ether 0:14:4f:f8:9:1d bge2: flags=201000843 mtu 1500 index 5 inet 0.0.0.0 netmask ff000000 groupname ipmp0 ether 0:14:4f:fb:ca:b bge2:1: flags=201000843 mtu 1500 index 5 inet 129.152.2.72 netmask ffff0000 broadcast 129.152.255.255 ... # if_mpadm -r bge1 # ifconfig -a ... bge1: flags=201000843 mtu 1500 index 4 inet 129.152.2.72 netmask ffff0000 broadcast 129.152.255.255 groupname ipmp0 ether 0:14:4f:f8:9:1d bge2: flags=201000843 mtu 1500 index 5 inet 0.0.0.0 netmask ff000000 groupname ipmp0 ether 0:14:4f:fb:ca:b ...
As mentioned above, using probe-based detection requires more IP addresses:
/etc/hostname.bge1: twilight netmask + broadcast + group ipmp0 up addif twilight-test-bge1 \ deprecated -failover netmask + broadcast + up
/etc/hostname.bge2: twilight-test-bge2 deprecated -failover netmask + broadcast + group ipmp0 upThree entries for hostname and IP address pairs will, of course, be needed in /etc/inet/hosts.
All that's left is a reboot of the Container. If a reboot is not practical at this time, you can accomplish the same effect by using ifconfig(1M) commands:
twilight# ifconfig bge1 plumb twilight# ifconfig bge1 twilight netmask + broadcast + group ipmp0 up addif \ twilight-test-bge1 deprecated -failover netmask + broadcast + up twilight# ifconfig bge2 plumb twilight# ifconfig bge2 twilight-test-bge2 deprecated -failover netmask + \ broadcast + group ipmp0 up
Whether link-based failure detection or probe-based failure detection is used, we have a Container with these network properties:
Tuesday Feb 05, 2008
Tuesday Oct 16, 2007
It's time for a "shameless plug"...
If you would like to develop deeper Solaris skills, LISA'07 offers some excellent opportunities. LISA is a conference organized by Usenix, and is intended for Large Installation System Administrators. This year, LISA will be held in Dallas, Texas, November 11-16. It includes vendor exhibits, training sessions and invited talks. This year the keynote address will be delivered by John Strassner, Motorola Fellow and Vice President, and is entitled "Autonomic Administration: HAL 9000 Meets Gene Roddenberry."
Many tutorials will be available, including four full-day sessions focusing on Solaris:
Early-bird
registration ends this Friday, October 19 and saves $Hundreds compared to the Procrastinator's Rate
.