Monday May 04, 2009
Monday May 04, 2009
One of my former colleagues, Joe Yanushpolsky (josephy100 -AT- gmail.com) was recently involved in the movement of a latency-sensitive Linux application to Solaris as part of platform consolidation. The code was old and it required access to kernel routines not available under BrandZ. Using VirtualBox as a virtual x86 system, the task was easier than expected.
Here is Joe's report of his tests.
The goals included allowing many people to independently run this application while sharing a server. It would be important to isolate each user from other users. But the resource controls included with VirtualBox were not sufficiently granular for the overall purpose. Solaris Containers (zones) have a richer set of resource controls. Would it be possible to combine Containers and VirtualBox?
The answer was 'yes' - I tried two slightly different methods. Each method starts by installing VirtualBox in the global zone to set up a device entry and some of the software. Details are provided later. After that is complete, the two methods differ.
When the process is complete, you have a guest OS, shown here via X Windows.
Not only did the code run well but it did so in a sparse root non-global zone
Well that was easy! How about Windows?
Now, this is interesting. As long as the client VM is supported by VirtualBox, it can be installed and run in a Solaris/OpenSolaris Container. I immediately thought of several useful applications of this combination of virtualization technologies:
.
Here are the highlights of "How to install." For more details, follow instructions in the VirtualBox User manual.
Friday May 01, 2009
I fixed all three bugs and posted v1.4.1 on the web site: http://opensolaris.org/os/project/zonestat/
Specifically:
Thursday Apr 23, 2009
Recently a young person asked me why Pluto isn't a planet. This seemed like a good educational opportunity. The explanation I used is much simpler than the official, scientific explanation - with its "planetary discrimants" and "aggregate masses" - and turned out better than I anticipated, so I thought I would share it with you. It seems appropriate for people with at least a fourth or fifth grade education.
The study of science results in "an organized body of knowledge gained through ... research." Observational science gathers data about the universe and classifies objects, life forms, etc. according to characteristics of those things.
Applying those concepts to our solar system, we can measure those objects and group those with similar characteristics together.
Let's try that.
One of the most obvious characteristics of solar system objects is their
composition. All of these objects fall neatly into one of three categories
as shown by the accompanying graph:

Although there is overlap in density between the gaseous objects and the
rocky/icy objects, no overlap between their sizes (masses) exists, as
this next graph shows (Earth is arbitrarily assigned a value of 1,000,000
and the rest are scaled to that value):

Visually, three groupings are discernible: the gaseous objects, the large rocky objects, and everything else. Mathematically, the groupings are separated by more than an order of magnitude. In other words, the smallest member of one group is at least ten times the mass of the largest member of the next group. Uranus is more than 14 times the mass of Earth, and Mercury is almost 20 times the mass of Eris, which is more massive than Pluto.
Besides physical characteristics, the most useful ones are the orbital elements.
All members of our solar system orbit the sun, or orbit another non-stellar object which in turn orbits the sun. These orbits are, almost entirely, described by Newtonian mechanics, the basic elements of which were first described by Johannes Kepler in 1609. Although an orbit has several characteristics, the simplest of them is the semi-major axis, which is often called the "average" distance between the orbiting body and the sun. Although not quite accurate, it's close enough for this purpose.
Here is a graph of 17 of the most important bodies in our solar system. It
shows the semi-major axis of each, relative to the semi-major axis of Earth,
which is called an Astronomical Unit.
It includes the four major rocky objects, the four gaseous objects, all
five of the currently recognized dwarf planets, three asteroids, and five
other relevant objects. Note that the orbital distances of Eris (68 AU)
and Sedna (526 AU) are off the scale of this graph.

Again, three groups appear in the graph: the inner rocky objects, the
gaseous objects, and the outer bodies. Separation between objects increases
from the inner bodies to the outer ones, but suddenly, starting with Orcus,
the separation between orbital distances shrinks considerably.
From those three characteristics: density, mass, and orbital distance, it seems clear that there are at least three groups of major bodies in the solar system:
| Object Type | Density g/cm^3 | Mass (Earth=1) | Semi-Major Axis (AU, Earth=1) |
|---|---|---|---|
| Inner, rocky | >3.8 | >0.1 | <2 |
| Giant, gaseous | <1.5 | >14 | 5-32 |
| Distant, icy, rocky | <2 (except one) | <1/200th | >35 |
All three of those groupings can be displayed in one graph, which shows
the distinction between the three different groups at a glance:

In the graph above, the green bars show the range of values for the inner, rocky bodies, scaled so that the highest value is 100. The blue bars show the ranges of values for the gaseous bodies. The purplish bars shows the ranges of values for the outer icy, rocky bodies.
Note that only two ranges overlap: densit for the gaseous and the icy rocky bodies. For all of the other characteristics, there are clear gaps between the ranges of the groups.
Now that we have clear groupings, the question becomes "which of those groups should be planets?" I hope it's obvious that the first two categories should be included in the list of 'planets.' The third group can be included or not, depending on how you want to define the term 'planet.'
However, there are two other factors which help me to decide. Initially, 'planets' were the five wandering lights that weren't the Sun and Moon. These seven wandering lights were so important that early western cultures assigned a day of worship to each, leading eventually to the names of our days.
If 'planets' started with the five wandering stars, it makes sense to add other bodies which have similar characteristics - Uranus and Neptune - yielding a total of eight planets. But none of the others - Pluto, Haumea, Quaoar, etc. - are like the original wandering lights in the sky.
Further, if we were to include the outer, rocky, icy bodies in the list of planets, the list grows significantly. Today, the list would include 13 members, but another 40 known objects might be categorized with Eris, Pluto, et al., and another 150 or more are probably out there. If the category 'planet' can have 8 members or 200, I'll go with 8.
Finally, regarding the question "is it 'right' to 'demote' Pluto?" The list of planets has grown and shrunk several times throughout history. More than 25 bodies have been labeled 'planets' only to be 'demoted' later. Pluto is nothing special in this regard.
Wednesday Apr 08, 2009
I have posted Zonestat v1.4 at: the Zone Statistics project page (click on "Files" in the left navbar).
Zonestat is a 'dashboard' for Solaris Containers. It shows resource consumption of each Container (aka Zone) and a comparison of consumption against limits you have set.
Changes from v1.3:
Note that the addition of a timestamp to -P output changes the output format for "machine-readable" output.
For most people, the most important change will be the use of DTrace to collect CPU% data. This has two effects. The first effect is improved correctness. The prstat command - used in V1.3 and earlier, can horribly underestimate CPU cycles consumed because it can miss many short-lived processes. The mpstat has its own problems with mis-counting CPU usage. So I expanded on a solution Jim Fiori offered, which uses DTrace to answer the question "which zone is using a CPU right now?"
The other benefit to DTrace is the improvement in performance of Zonestat.
The less popular, but still interesting additions include:
Please send questions and requests to zones-discuss@opensolaris.org .
Wednesday Apr 01, 2009
If you have patched a system with many zones, you have learned that it takes longer than patching a system without zones. The more zones there are, the longer it takes. In some cases, this can raise application downtime to an unacceptable duration.
Fortunately, there are a few methods which can be used to reduce application downtime. This document mentions many of them, and then describes the performance enhancements of two of them. But the bulk of this rather bulky entry is the description and results of my newest computer... "experiments."
| Executive Summary, for the Attention-Span Challenged |
It's important to distinguish between application downtime, service downtime, zone downtime, and platform downtime. 'Service' is the service being provided by an application or set of applications. To users, that's the most important measure. As long as they can access the service, they're happy. (Doesn't take much, does it?)
If a service depends on the proper operation of each of its component applications, planned or unplanned downtime of one application will result in downtime of the service. Some software, e.g. web server software, can be deployed in multiple, load-balanced systems so that the service will not experience downtime even if one of the software instances is down.
Applying an operating system patch may require service downtime, application downtime, zone downtime or platform downtime, depending on the patch and the entity being patched. Because in many cases patch application will require application downtime, the goal of the methods mentioned below, especially parallel patching of zones, is to minimize elapsed downtime to achieve a patched, running system.
Disclaimer 1: the "Zones Parallel Patching" patch ("ZPP") is still in testing and has not yet been released. It is expected to be released mid-CY2009. That may change. Further, the specific code changes may change, which may change the results described below.
Disclaimer 2: the experiment described below, and its results, are specific to one type of system (Sun Fire T2000) and one patch (120534-14 - "the Apache patch"). Performance improvements using other hardware and other patches will produce different results.
I wanted to better understand two methods of accelerating the patching of zoned systems, especially when used in combination. Currently, a patch applied to the global zone will normally be applied to all non-global zones, one zone at a time. This is a conservative approach to the task of patching multiple zones, but doesn't take full advantage of the multi-tasking abilities of Solaris.
I learned that a proposed patch was created that enables the system administrator to apply a patch in the global zone which patches the global and then patches multiple zones at the same time. The parallelism (i.e. "the number of zones that are patched at one time") can be chosen before the patch is applied. If there are multiple "Solaris CPUs" in the system, multiple CPUs can be performing computational steps at the same time. Even if there aren't many CPUs, one zone's patching process can be using a CPU while another's is writing to a disk drive.
<tangent topic="Solaris vCPU"> I use the phrase "Solaris CPUs" to refer to the view that Solaris has of CPUs. In the old days, a CPU was a CPU - one chip, one computational entity, one ALU, one FPU, etc. Now there are many factors to consider - CPU sockets, CPU cores per socket, hardware threads per core, etc. Solaris now considers "vCPUs" - virtual processors - as entities on which to schedule processors. Solaris considers each of these a vCPU:
Separately, I realized that one part of patching is disk-intensive. Many disk-intensive workloads benefit from writing to a solid-state disk (SSD) because of the performance benefit of those devices over spinning-rust disk drives (HDD).
So finally (hurrah!) the goal of this adventure: how much performance advantage would I achieve with the combination of parallel patching and an SSD, compared to sequential patching of zones on an HDD?
I took advantage of an opportunity to test both of these methods to accelerate patching. The system was a Sun Fire T2000 with two HDDs and one SSD. The system had 32 vCPUs, was not using Logical Domains, and was running Solaris 10 10/08. Solaris was installed on the first HDD. Both HDDs were 72GB drives. The SSD was a 32GB device. (Thank you, Pat!)
For some of the tests I also applied the ZPP. (Thank you, Enda!) For some of the tests I used zones that had zonepaths on the SSD; the rest 'lived' on the second HDD.
As with all good journeys, this one had some surprises. And, as with all good research reports, this one has a table with real data. And graphs later on.
To get a general feel for the different performance of an HDD vs. an SSD, I created a zone on each - using the secondary HDD - and made clones of it. Some times I made just one clone at a time, other times I made ten clones simultaneously. The iostat(1) tool showed me the following performance numbers:
| r/s | w/s | kr/s | kw/s | wait | actv | svc_t | %w | %b | |
|---|---|---|---|---|---|---|---|---|---|
| clone x1 on HDD | 56 | 227 | 833 | 3323 | 0 | 6.4 | 23 | 1 | 72 |
| clone x1 on SSD | 35 | 379 | 115 | 6165 | 0 | 0 | 0 | 1 | 6 |
| clone x10 on HDD | 35 | 470 | 182 | 3274 | 31 | 95 | 246 | 25 | 99 |
| clone x10 on SSD | 354 | 2958 | 2413 | 30262 | 4 | 15 | 4 | 10 | 34 |
At light load - just one clone at a time - the SSD performs better than the HDD, but at heavy load the SSD performs much much better, e.g. nine times the write throughput and 13x the write IOPS of the HDD, and the device driver and SSD still have room for more (34% busy vs. 99% busy).
Cloning a zone consists almost entirely of copying files. Patching has a higher proportion of computation, but those results gave me high hopes for patching. I wasn't disappointed. (Evidently, every good research report also includes foreshadowing.)
In addition to measuring the performance boost of the ZPP I wanted to know if that patch would help - or hurt - a system without using its parallelization feature. (I didn't have a particular reason to expect non-parallelized improvement, but occasionally I'm an optimist. Besides, if the performance with the ZPP was different without actually using parallelization, it would skew the parallelized numbers.) So before installing the patch, I measured the length of time to apply a patch. For all of my measurements, I used patch 120543-14 - the Apache patch. At 15MB, it's not a small patch, nor is it very large patch. (The "Baby Bear" patch, perhaps? --Ed.) It's big enough to tax the system and allow reasonable measurements, but small enough that I could expect to gather enough data to draw useful conclusions, without investing a year of time...
So, before applying the ZPP, and without any zones on the system, I applied the Apache patch. I measured the elapsed time because our goal is to minimize elapsed time of patch application. Then I removed the Apache patch.
Then I added a zone to the system, on the secondary HDD, and, I re-applied the Apache patch to the system, which automatically applied it to the zone as well. I removed the patch, created two more zones, and applied the same patch yet again. Finally, I compared the elapsed time of all three measurements. Patching the global zone alone took about 120 seconds. Patching with one non-global zone took about 175 seconds: 120 for the global zone and 55 for the zone. Patching three zones took about 285 seconds: 120 seconds for the global zone and 55 seconds for each of the three zones.
Theoretically, the length of time to patch each zone should be consistent. Testing that theory, I created a total of 16 zones and then applied the Apache patch. No surprises: 55 seconds per zone.
To test non-parallel performance of the ZPP, I applied it, kept the default setting of "no parallelization," and then re-did those tests. Application of the Apache patch did not change in behavior nor in elapsed time per zone, from zero to 16 zones. (However, I had a faint feeling that Solaris was beginning to question my sanity. "Get inline," I told it...)
How about the SSD - would it improve patch performance with zero or more zones? I removed the HDD zones and installed a succession of zones - zero to 16 - on the SSD and applied the Apache patch each time. The SSD did not help at all - the patch still took 55 seconds per zone. Evidently this particular patch is not I/O bound, it is CPU bound.
But applying the ZPP does not, by default, parallelize anything. To tell the patch tools that you would like some level of parallelization, e.g. "patch four zones at the same time," you must edit a specific file in the /etc/patch directory and supply a number, e.g. '4'. After you have done that, if parallel patching is possible, it will happen automatically. Multiple zones (e.g. four) will be patched at the same time by a patchadd process running in each zone. Because that patchadd is running in a zone, it will use the CPUs that the zone is assigned to use - default or otherwise. This also means that the zone's patchadd process is subject to all of the other resource controls assigned to the zone, if any.
That seems like magic! What's the trick? A few paragraphs back, I mentioned the 'trick': using all of the scalability of Solaris and, in this case, CMT systems. Patching a system without ZPP - especially one without a running application - leaves plenty of throughput performance "on the table." Patching muliple zones simultaneously uses CPU cycles - presumably cycles that would have been idle. And it uses I/O channel and disk bandwidth - also, hopefully, available bandwidth. Essentially, ZPP is shortening the elapsed time by using more CPU cycles and I/O bandwidth now instead of using them later.
So the main caution is "make sure there is sufficient compute and I/O capacity to patch multiple zones at the same time."
But whenever multiple apps are running on the same system at the same time, the operating system must perform extra tasks to enable them to run safely. It doesn't matter if the "app" is a database server or 'patchadd.' So is ZPP using any "extra" CPU, i.e. is there any CPU overhead?
| Along the way, I collected basic CPU statistics, including system and user time. The next graph shows that the amount of total CPU time (user+sys) increased slightly. The overhead was less than 10% for up to 8 zones. Another coincidence? I don't know, but at that point the overhead was roughly 1% per zone. The overhead increased faster beyond P=8, indicating that, perhaps, a good rule of thumb is P="number of unused cores." Of course, if the system is using Dynamic Resource Pools or dedicated-cpus, the rule might need to be changed accordingly. TANSTAAFL. |
|
Getting the ZPP requires waiting until mid-year. Getting SSDs is easy - they're available for the Sun 7210 and 7410 Unified Storage Systems and for Sun systems.
Wednesday Mar 04, 2009
AMD has been slowly spinning off its chip fab business for the past few years, and in the process is building a new US$4.2 billion fab in the northeast USA - near Albany, New York. Cash-strapped AMD was only able to do this via a joint venture with an investment fund headquartered in Abu Dhabi.
More details are available at eWeek, PCWorld, and the major newspaper in Albany, the Times Union.
Tuesday Feb 10, 2009
Recently, Thomson Reuters "demonstrated that RMDS [Reuters Marked Data Systems software] performs better in a virtualized environment with Solaris Containers than it does with a number of individual Sun server machines."
This enabled Thomson Reuters to break the "million-messages-per-second barrier."
The performance improvement is probably due to the extremely high bandwidth, low latency characteristics of inter-Container network communications. Because all inter-Container network traffic is accomplished with memory transfers - using default settings - packets 'move' at computer memory speeds, which are much better than common 100Mbps or 1Gbps ethernet bandwidth. Further, that network performance is much more consistent without extra hardware - switches and routers - that can contribute to latency.
Articles can be found at: http://finance.yahoo.com/news/Sun-Microsystems-and-Thomson-bw-14306924.html
Thursday Jan 29, 2009
The most significant new functionality is a feature called
"Zone
Clusters"
which, at this point, 'merely'
provides support for Oracle RAC nodes
in Zones. In other words, you can create an Oracle RAC cluster, using
individual zones in a Solaris Cluster as RAC nodes.
Further, because a Solaris Cluster can contain multiple Zone Clusters, it can contain multiple Oracle RAC clusters. For details about configuring a zone cluster, see "Configuring a Zone Cluster" in the Sun Cluster Software Installation Guide and the clzonecluster(1CL) man page.
The second new feature is support for exclusive-IP zones. Note that this only applies to failover data services, not scalable data services nor with zone clusters.
Friday Jan 09, 2009
To shine some light on the topic of Zones and security, Glenn Brunette and I recently co-authored a new Sun BluePrint with an overly long name
- "Understanding the Security Capabilities of Solaris Zones Software." You can find it at
http://www.sun.com/blueprints.
Tuesday Jan 06, 2009
A completely random thought: is there a name for the motion of a carousel horse? One of those would be informative and impish:
[For you overly serious types
- Yes, I know that both of those
answers are incorrect because 'donut' and 'torus' refer
to 3-dimensional surfaces. My question and answers are not meant
to be geometrically correct. However, if there really is a term describing
the combination of sinusoidal and
circular motions of a carousel horse, I would like to know what it is.]
Monday Dec 22, 2008
However, with every challenge there are opportunities - in this case, photographic ones. So I fired up the DSLR and started snapping - pictures, not wires.
One pine tree in my backyard was so laden with ice that its tip - normally
25 feet in the air - was dangling in the pond. It looks like the pine tree
was thirsty and is taking a drink. (Click on the image to see a larger image.)

Later, the surface of the pond froze, trapping the tip in the pond.
Fortunately - for the tree - the surface melted two days later, allowing it
to shake itself free.
A birch tree in the front yard performed a similar feat, but it looked
more like it was bowing. I doubt it was trying lick the snow - it knows better.

I like the loss of background clutter that night - and flash! - brings to shots
like that one.
Another birch was bent, and its upper half reached, like fingers, through
the branches of a Shadblow
tree, itself coated in ice.

Another night shot, a large pine seems to loom gloomily over a 6-foot
blue spruce. Normally its arms jut out parallel to the ground, but the ice
pinned its arms to its sides.

But by far the worst damage nearby was a 40-to-45-foot pine tree in
the backyard. For years, it has been leaning out over the pond. No more -
the weight of the ice snapped it in two, about eight feet up the trunk.
In the first image, only the remaining trunk is obvious...

...but in the next picture, it's clear that the tree decided to "take a dip"
in the pond. To give you some scale, the pond is 40 feet wide. The tree
reached all the way across and stripped some branches off of a tree on
the far side of the pond.

As someone mentioned to me - the ice storm was "just Mother Nature doing some pruning."
P.S. Nighttime brought another interesting view: moonlight refracting through the ice on
tree branches.

Wednesday Dec 10, 2008
Monday Nov 24, 2008
It's - already - time for a zonestat update. I was never happy with the method that zonestat used to discover the mappings of zones to resource pools, but wanted to get v1.1 "out the door" before I had a chance to improve on its use of zonecfg(1M). The obvious problem, which at least one person stumbled over, was the fact that you can re-configure a zone while it's running. After doing that, the configuration information doesn't match the current mapping of zone to pool, and zonestat became confused.
Anyway, I found the time to replace the code in zonestat which discovered zone-to-pool mappings with a more sophisticated method. The new method uses ps(1) to learn the PID that each zone's [z]sched process is. Then it uses "poolbind -q <PID>" to look up the pool for that process. The result is more accurate data, but the ps command does use more CPU cycles.
While performing surgery on zonestat, I also:
You can find it at
http:://opensolaris.org/os/project/zonestat.
Tuesday Nov 18, 2008
I began to list them: prstat(1M), poolstat(1M), ipcs(1), kstat(1M), rcapstat(1), prctl(1), ... Obviously it would be easier to monitor Containers if there was one 'dashboard' to view. Such a dashboard would enable zone administrators to easily review zones' usage of system resources and decide if further investigation is necessary. Also, if there is a system-wide shortage of a resource, this tool would be the first out of the toolbox, simplifying the task of finding the 'resource hog.'
Its output looks like this:
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 100 25 986M 139K 18E 2.2M 18E 754M
db01 0D 66K 2 0.1 200 50 1G 122M 536M 0.0 536M 0 1G 135M
web02 0D 66K 2 0.4 0.0 100 25 100M 11M 20M 0.0 20M 0 268M 8M
==TOTAL= --- ---- 2 ---- 0.2 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
The 'Pool' columns provide information about the
Dynamic Resource Pool
in which the zone's processes are running. The two-character 'I' column
displays the pool ID (number) and the 'T' column indicates the type of
pool - 'D' for 'default', 'P' for 'private' (using the dedicated-cpu
feature of zonecfg) or 'S' for 'shared.' The two 'Size' columns show
the quantity of CPUs assigned to the pool in which the zone is running.
The 'CPU Pset' columns show each zone's CPU usage and any caps that have been set. The first two columns show CPU quantities - CPU cores for x86, SPARC64 and all UltraSPARC systems except CMT (T1, T2, T2+). On CMT systems, Solaris considers every hardware thread ('strand') to be a CPU, and calls them 'vCPUs.'
The last two CPU columns - 'Shr' and 'S%' - show the number of FSS shares assigned to the zone, and what percentage of the total number of shares in that zone's pool. In the example above, all the zones share the default pset, and the zone 'db01' has two shares, so it should receive 50% of the CPU power of the pool at a minimum.
The 'Memory' columns show the caps and usage for RAM, locked memory and virtual memory. Note that virtual memory is RAM plus swap space.
The syntax of zonestat is very similar to the other *stat tools:
zonestat [-l] [interval [count]]The output shown above is generated with the -l flag, which means "show the limits (caps) that have been set." Without -l, only usage columns are displayed.
Here is more output, showing some of the conclusions that can be drawn from the data. I have added parenthetical numbers in the right-hand in order to refer to specific lines of output.
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 100 HH 983M 139K 18E 2.2M 18E 752M
==TOTAL= --- ---- 2 ---- 0.1 -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
--------
global 0D 66K 2 0.1 100 HH 983M 139K 18E 2M 18E 752M
==TOTAL= --- ---- 2 ---- 0.1 -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
Note that the none of the non-global zones are running. Because the global
zone is the only zone running in its pool, its 100 FSS shares represent
100% of the shares in its pool. To save a column of output, I indicate
that with 'HH' instead of '100'.
The "==TOTAL=" line provides two types of information, depending on the column type. For usage information, the sum of the resource used is shown. For example, "RAM Use" shows the amount of RAM used by all zones, including the global zone. For resource controls, either the system's amount of the resource is shown, e.g. "RAM Cap", or hyphens are displayed.
Note that there is a maximum amount of RAM that can be locked in a Solaris system. This prevents all of memory from being locked down, which would prevent the virtual memory system from running. In the output above, this system will only allow 4.1GB of RAM to be locked.
Also note that the amount of VM used is less than the amount of RAM used. This is because the memory pages which contain a program's instructions are not backed by swap disk, but by the file system itself. Those 'text' pages take up RAM, but do not take up swap space.
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.1 100 50 1.0G 30M 536M 0.0 536M 0.0 1.0G 27M
==TOTAL= --- ---- 2 ---- 0.2 -- -- 4.3G 1.0G 4.3G 139K 4.1G 2.2M 5.3G 780M
A zone has booted. It has caps for RAM, shared memory, locked
memory, and VM. The default pool now has a total of 200 shares: 100
for each zone. Therefore, each zone has 50% of the shares in that pool.
This provides a good reason to change the global zone's FSS value from
its default of one share to a larger value as soon as you add the first
zone to a system.
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.3 100 50 1G 93M 536M 0.0 536M 0.0 1G 95M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 848M
--------
global 0D 66K 2 0.1 100 50 981M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.4 100 50 1G 122M 536M 0.0 536M 0.0 1G 135M
==TOTAL= --- ---- 2 ---- 0.5 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
The zone 'z3' is still booting, and is using 0.4 CPUs worth of CPU cycles.
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.3 100 50 1G 122M 536M 536M 0.0 1G 135M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
global 0D 66K 2 0.1 100 50 984M 139K 18E 2.2M 18E 753M
z3 0D 66K 2 0.2 100 50 1G 122M 536M 536M 0.0 1G 135M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
global 0D 66K 2 0.1 100 33 986M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.1 100 33 1G 122M 536M 536M 0.0 1G 135M
web02 0D 66K 2 0.4 0.0 100 33 100M 11M 20M 20M 0.0 268M 8M
==TOTAL= --- ---- 2 ---- 0.2 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
A third zone has booted. This zone has a
CPU-cap
of 0.4 CPUs. It also has
memory caps, including a RAM cap that is less than the amount of memory that
zone 'z3' is using. If the two zones need the same amount, web02 should begin
paging before long. Let's see what happens...
--------
global 0D 66K 2 0.1 1 33 985M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.1 1 33 1G 122M 536M 536M 0.0 1G 135M
web02 0D 66K 2 0.4 0.1 1 33 100M 29M 20M 20M 0.0 268M 36M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 925M
--------
global 0D 66K 2 0.1 1 33 984M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.1 1 33 1G 122M 536M 536M 0.0 1G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 63M 20M 20M 0.0 268M 138M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
global 0D 66K 2 0.1 1 33 985M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 122M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 87M 20M 20M 0.0 268M 185M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.1G
--------
global 0D 66K 2 0.1 1 33 985M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 122M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 100M 20M 20M 0.0 268M 112M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
global 0D 66K 2 0.1 1 33 984M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 122M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.3 1 33 100M 112M 20M 20M 0.0 268M 117M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
As expected, web02 exceeds its RAM cap. Now rcapd should address the problem.
--------
global 0D 66K 2 0.1 1 33 981M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1G 119M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.3 1 33 100M 111M 20M 20M 0.0 268M 127M
==TOTAL= --- ---- 2 ---- 0.4 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
One of two things has happened: either a process in web02 freed up memory, or
rcapd caused pageouts. rcapstat(1M) will tell us which it is. Also, the
increase in VM usage indicates that more memory was allocated than freed, so
it's more likely that rcapd was active during this period.
--------
global 0D 66K 2 0.1 1 33 981M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1.0G 119M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 110M 20M 20M 0.0 268M 133M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
global 0D 66K 2 0.1 1 33 978M 139K 18E 2.2M 18E 754M
z3 0D 66K 2 0.0 1 33 1.0G 116M 536M 536M 0.0 1.0G 135M
web02 0D 66K 2 0.4 0.2 1 33 100M 91M 20M 20M 0.0 268M 133M
==TOTAL= --- ---- 2 ---- 0.3 -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
At this point 'web02' is safely under its RAM cap. If this zone began to do
'real' work, it would continually be under memory pressure, and the value in
'Memory:RAM:Use" would fluctuate around 100M. When setting a RAM cap, it
is very important to choose a reasonable valuable to avoid causing
unnecessary paging.
One final example, taken from a different configuration of zones:
|----Pool-----|------CPU-------|----------------Memory----------------|
|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap|Used| Cap|Used| Cap|Used| Cap|Used
-------------------------------------------------------------------------------
global 0D 66K 1 0.0 0.0 200 66 1.2G 18E 343K 18E 2.6M 18E 1.1G
zB 0D 66K 1 0.2 0.0 100 33 124M 18E 0.0 18E 0.0 18E 138M
zA 1P 1 1 0.0 0.1 1 HH 31M 18E 0.0 18E 0.0 18E 24M
==TOTAL= --- ---- 2 ---- 0.1 --- -- 4.3G 1.4G 4.3G 343K 4.1G 2.6M 5.3G 1.2G
The global zone and zone 'zB' share the default pool. Because the global
zone has 200 FSS shares, compared to zB's 100 shares, global zone
processes will get 2/3 of the processing power of the default pool if
there is contention for that CPU. However, that is unlikely, because zB
is capped at 0.2 CPUs worth of compute time.
Zone 'zA' is in its own private resource pool. It has exclusive access to the one dedicated CPU in that pool.
Zonestat's biggest problem is due to its brute-force nature. It runs a few commands for each zone that is running. This can consume many CPU cycles, and can take a few seconds to run with many zones. Performance improvements to zonestat are underway.
With mpstat, it is not difficult to create surprising results, e.g. on a CMT system, set a CPU cap on a zone in a pool, and run a few CPU-bound processes: the "Pset Used" column will not reach the CPU cap. This is due to the method used by mpstat to calculate its data.
Prstat only computes its data occasionally, ignoring anything that happened between samples. This leads to undercounting CPU usage for zones with many short-lived processes.
I wrote code to gather data from each, but prstat seemed more useful, so for now the output comes from prstat.
The future of zonestat might include these:
Friday Sep 05, 2008
DBAs are in for a rude awakening.
A database runs most efficiently when all of the data is held in RAM. Insufficient RAM causes some data to be sent to a disk drive for later retrieval. This process, called 'paging' can have a huge performance impact. This can be shown numerically by comparing the time to retrieve data from disk (about 10,000,000 nanoseconds) to the access time for RAM (about 20 ns).
Databases are the backbone of most Internet services. If a database does not perform well, no amount of improvement of the web servers or application servers will achieve good performance of the overall service. That explains the large amount of effort that is invested in tuning database software and database design.
These tasks are complicated by the difficulty of scaling a single database to many systems in the way that web servers and app servers can be replicated. Because of those challenges, most databases are implemented on one computer. But that single system must have enough RAM for the database to perform well.
Over the years, DBAs have come to expect systems to have lots of memory, either enough to hold the entire database or at least enough for all commonly accessed data. When implementing a database, the DBA is asked "how much memory does it need?" The answer is often padded to allow room for growth. That number is then increased to allow room for the operating system, monitoring tools, and other infrastructure software.
And everyone was happy.
But then server virtualization was (re-)invented to enable workload consolidation.
Server virtualization is largely about workload isolation - preventing the actions and requirements of one workload from affecting the others. This includes constraining the amount of resources consumed by each workload. Without such constraints, one workload could consume all of the resources of the system, preventing other workloads from functioning effectively. Most virtualization technologies include features to do this - to schedule time using the CPU(s), to limit use of network bandwidth... and to cap the amount of RAM a workload can use.
That's where DBAs get nervous.
I have participated in several virtualization architecture conversations which
included:
Me: "...and you'll want to cap the amount of RAM that each workload
can use."
DBA: "No, we can't limit database RAM."
Taken out of context, that statement sounds like "the database needs infinite RAM." (That's where the CFO gets nervous...)
I understand what the DBA is trying to say:
DBA: "If the database doesn't have sufficient RAM, its performance will
be horrible, and so will the performance of the web and app servers that
depend on it."
I completely agree with that statement.
The misunderstanding is that the database is not expected to use less memory than before. The "rude awakening" is modifying one's mind set to accept the notion that a RAM cap on a virtualized workload is the same as having a finite amount of RAM - just like a real server.
This also means that system architects must understand and respect the DBA's point of view, and that a virtual server must have available to it the same amount of RAM that it would need in a dedicated system. If a non-consolidated database needed 8GB of RAM to run well in a dedicated system, it will still need 8GB of RAM to run well in a consolidated environment.
If each workload has enough resources available to it, the system and all of its workloads will perform well.
And they all computed happily ever after.
P.S. Memory needs of consolidated systems require that a system running multiple workloads will need more memory than each of the unconsolidated systems had - but less than the aggregate amount they had.
Considering that need, and the fact that most single-workload systems were running at 10-15% CPU utilization, I advise people configuring virtual server platforms to focus more effort on ensuring that the computer has enough memory for all of its workloads, and less effort on achieving sufficient CPU performance. If the system is 'short' on CPU power by 10%, performance will be 10% less than expected. That rarely matters. But if the system is 'short' on memory by 10%, excessive paging can cause transaction times to increase by 10 times, 100 times, or more.