From My Brain to Your Browser
Jeff Victor's Blog
Archives
« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today
Click me to subscribe
Search

Links
 

Today's Page Hits: 142

« Previous page | Main | Next page »
Friday Sep 05, 2008
got (enough) memory?

DBAs are in for a rude awakening.

A database runs most efficiently when all of the data is held in RAM. Insufficient RAM causes some data to be sent to a disk drive for later retrieval. This process, called 'paging' can have a huge performance impact. This can be shown numerically by comparing the time to retrieve data from disk (about 10,000,000 nanoseconds) to the access time for RAM (about 20 ns).

Databases are the backbone of most Internet services. If a database does not perform well, no amount of improvement of the web servers or application servers will achieve good performance of the overall service. That explains the large amount of effort that is invested in tuning database software and database design.

These tasks are complicated by the difficulty of scaling a single database to many systems in the way that web servers and app servers can be replicated. Because of those challenges, most databases are implemented on one computer. But that single system must have enough RAM for the database to perform well.

Over the years, DBAs have come to expect systems to have lots of memory, either enough to hold the entire database or at least enough for all commonly accessed data. When implementing a database, the DBA is asked "how much memory does it need?" The answer is often padded to allow room for growth. That number is then increased to allow room for the operating system, monitoring tools, and other infrastructure software.

And everyone was happy.

But then server virtualization was (re-)invented to enable workload consolidation.

Server virtualization is largely about workload isolation - preventing the actions and requirements of one workload from affecting the others. This includes constraining the amount of resources consumed by each workload. Without such constraints, one workload could consume all of the resources of the system, preventing other workloads from functioning effectively. Most virtualization technologies include features to do this - to schedule time using the CPU(s), to limit use of network bandwidth... and to cap the amount of RAM a workload can use.

That's where DBAs get nervous.

I have participated in several virtualization architecture conversations which included:
Me: "...and you'll want to cap the amount of RAM that each workload can use."
DBA: "No, we can't limit database RAM."

Taken out of context, that statement sounds like "the database needs infinite RAM." (That's where the CFO gets nervous...)

I understand what the DBA is trying to say:
DBA: "If the database doesn't have sufficient RAM, its performance will be horrible, and so will the performance of the web and app servers that depend on it."

I completely agree with that statement.

The misunderstanding is that the database is not expected to use less memory than before. The "rude awakening" is modifying one's mind set to accept the notion that a RAM cap on a virtualized workload is the same as having a finite amount of RAM - just like a real server.

This also means that system architects must understand and respect the DBA's point of view, and that a virtual server must have available to it the same amount of RAM that it would need in a dedicated system. If a non-consolidated database needed 8GB of RAM to run well in a dedicated system, it will still need 8GB of RAM to run well in a consolidated environment.

If each workload has enough resources available to it, the system and all of its workloads will perform well.

And they all computed happily ever after.

P.S. Memory needs of consolidated systems require that a system running multiple workloads will need more memory than each of the unconsolidated systems had - but less than the aggregate amount they had.

Considering that need, and the fact that most single-workload systems were running at 10-15% CPU utilization, I advise people configuring virtual server platforms to focus more effort on ensuring that the computer has enough memory for all of its workloads, and less effort on achieving sufficient CPU performance. If the system is 'short' on CPU power by 10%, performance will be 10% less than expected. That rarely matters. But if the system is 'short' on memory by 10%, excessive paging can cause transaction times to increase by 10 times, 100 times, or more.

Posted at 11:48AM Sep 05, 2008 by Jeffrey Victor in Technology  |  Comments[2]

Saturday Aug 30, 2008
Mad Cows, Lame Ducks
I rarely blog about politics, but a recent news article incensed me enough that I thought it worth sharing.

It seems that the current executive branch of the United States government doesn't like its citizens. Back in April 2004, the N.Y. Times reported that the U.S. Department of Agriculture "refused to allow a Kansas beef producer to test all of its cattle for mad cow disease."

One might think that testing food for the ability to cause a fatal, untreatable disease would be encouraged by the US government. But in this case, apparently the Agriculture Department is choosing sides in a business struggle, putting the interests of one group of corporations - large meat packers - ahead of another group - small meat packers. In all of this, the Bush administration is putting business before public health, yielding to the wishes of "larger meat packers opposed to such testing" who "fear they too will have to conduct the expensive tests" in order to remain competitive in the US. (recent Associated Press article)

To be sure, Creekstone is not acting on a wholly altruistic basis. They want to be able to sell their products into the Japanese beef market, which has repeatedly prohibited beef imports from the US because of concerns over "mad cow" disease. Creekstone wants to enhance its competitiveness by proving that its beef is safe.

I'm not suggesting that a government should require this test on beef products. But if one company wants to take steps that protect public health, the federal government should not get involved, especially if the goal is merely to protect the profits of Big Business.

Posted at 11:24AM Aug 30, 2008 by Jeffrey Victor in Politics, Global Topics  | 

Thursday Aug 21, 2008
Virtual Eggs, One Basket
One of the hottest computer industry trends is virtualization. If you skim off the hype, there is still a great deal to be excited about. Who doesn't like reducing the number of servers to manage, and reducing the electric power consumed by servers and by the machines that move the heat they create (though I supposed that the power utilities and the coal and natural gas companies aren't too thrilled by virtualization...)

But there are some critical factors which limit the consolidation of workloads into virtualized environments (VE's). One, often-overlooked factor is that the technology which controls VE's is a single point of failure (SPOF) for all of the VE's it is managing. If that component (a hypervisor for virtual machines, an OS kernel for operating system-level virtualization, etc.) has a bug which affects its guests, they may all be impacted. In the worst case, all of the guests will stop working.

One example of that was the recent licensing bug in VMware. If the newest version of VMware ESX was in use, the hypervisor would not permit guests to start after August 12. EMC created a patch to fix the problem, but solving it and testing the fix took enough time that some customers could not start some workloads for about one day. For some details, see http://www.networkworld.com/news/2008/081208-vmware-bug.html and http://www.deploylinux.net/matt/2008/08/all-your-vms-belong-to-us.html.

Clearly, the lesson from this is the importance of designing your consolidated environments with this factor in mind. For example, you should never configure both nodes of a high-availability (HA) cluster as guests of the same hypervisor. In general, don't assume that the hypervisor is perfect - it's not - and that it can't fail - it can.

Posted at 12:32PM Aug 21, 2008 by Jeffrey Victor in Technology  | 

Tuesday Jun 03, 2008
Virtualization ^2
Have you ever wanted to try a Solaris feature using your non-Solaris desktop or laptop? Now you can! Jeff Savit shows us a method to use the OpenSolaris LiveCD to run OpenSolaris right off the CD - without modifying your existing system - even if your computer usually runs MacOS, Windows, or Linux. His example demonstrates the creation of Solaris Containers on that system. They are very temporary Conatainers - they only last until that particular LiveCD session is halted - but they are full-featured, enabling you to apply resource controls, access the network, mount other file systems, etc.

(The other) Jeff explains it all very well, but I'll emphasize that you can just follow his lead, or you can boot your computer directly from the OpenSolaris LiveCD, use OpenSolaris for a little while, and then re-boot back into your 'normal' operating system. VirtualBox is wonderful technology, but is not necessary to temporarily run OpenSolaris.

Posted at 09:00AM Jun 03, 2008 by Jeffrey Victor in Solaris  | 

Monday Jun 02, 2008
USENIX Technical Conference
If you would like to develop deeper Solaris skills, the Usenix Technical Conference offers some excellent opportunities. This year, the conference will be held in Boston on June 22-27. It includes vendor exhibits, training sessions and invited talks. This year the keynote address is by David Patterson, Director of the U.C. Berkeley Parallel Computing Laboratory, on "The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?"

Many tutorials will be available, including these five full-day sessions:

My session covers the concepts, uses and administrative interfaces of Solaris Resource Management, focusing on Solaris Containers . It will include a lab session using the new OpenSolaris LiveCD. If you don't (yet...) run Solaris or OpenSolaris on your laptop, you will still be able to boot OpenSolaris from the CD and temporarily use Containers on your laptop, without modifying your normal laptop operating system.

Early-bird registration saves $$$, but ends this Friday, June 6.

Posted at 09:00AM Jun 02, 2008 by Jeffrey Victor in Solaris  | 

Friday May 02, 2008
Effortless Upgrade: Solaris 9 System to Solaris 9 Container
Sun released Solaris 9 Containers earlier this week. This set of software packages enables you to move an existing Solaris 9 system into a Solaris Container on a Solaris 10 system. It also allows you to create new Solaris 9 Containers on Solaris 10 systems. In other words, it's just like Solaris 8 Containers but for Solaris 9 systems.

This is particularly interesting for Sun's CMT systems - those systems based on the UltraSPARC-T1, -T2, and -T2+ (aka Niagara, Niagara 2, Niagara 2+). Those systems are well known for the high performance-per-watt characteristics, an important consideration as data centers exhaust their power capacity and the price of fossil fuels rise.

Solaris 8 (and 9) Containers can also take advantage of the impressive scalability of the Sun SPARC Enterprise M-series systems - from 4 to 64 dual-core SPARC CPUs. Because of the ability to mix Solaris 8 Containers and Solaris 9 Containers, alongside Solaris 10 Containers, you can move dozens of older SPARC systems into just a few new SPARC systems.

You can find product details, a videotaped demonstration, and free download at http://www.sun.com/software/solaris/containers/index.jsp.

Posted at 09:49AM May 02, 2008 by Jeffrey Victor in Solaris 10 Containers  | 

Wednesday Apr 09, 2008
Effortless Upgrade: Solaris 8 System to Solaris 8 Container
[Update 23 Aug 2008: Solaris 9 Containers was released a few months back. Reply to John's comment: yes, the steps to use Solaris 9 Containers are the same as the steps for Solaris 8 Containers]

Since 2005, Solaris 10 has offered the Solaris Containers feature set, creating isolated virtual Solaris environments for Solaris 10 applications. Although almost all Solaris 8 applications run unmodified in Solaris 10 Containers, sometimes it would be better to just move an entire Solaris 8 system - all of its directories and files, configuration information, etc. - into a Solaris 10 Container. This has become very easy - just three commands.

Sun offers a Solaris Binary Compatibility Guarantee which demonstrates the significant effort that Sun invests in maintaining compatibility from one Solaris version to the next. Because of that effort, almost all applications written for Solaris 8 run unmodified on Solaris 10, either in a Solaris 10 Container or in the Solaris 10 global zone.

However, there are still some data centers with many Solaris 8 systems. In some situations it is not practical to re-test all of those applications on Solaris 10. It would be much easier to just move the entire contents of the Solaris 8 file systems into a Solaris Container and consolidate many Solaris 8 systems into a much smaller number of Solaris 10 systems.

For those types of situations, and some others, Sun now offers Solaris 8 Containers. These use the "Branded Zones" framework available in OpenSolaris and first released in Solaris 10 in August 2007. A Solaris 8 Container provides an isolated environment in which Solaris 8 binaries - applications and libraries - can run without modification. To a user logged in to the Container, or to an application running in the Container, there is very little evidence that this is not a Solaris 8 system.

The Solaris 8 Container technology rests on a very thin layer of software which performs system call translations - from Solaris 8 system calls to Solaris 10 system calls. This is not binary emulation, and the number of system calls with any difference is small, so the performance penalty is extremely small - typically less than 3%.

Not only is this technology efficient, it's very easy to use. There are five steps, but two of them can be combined into one:

  1. install Solaris 8 Containers packages on the Solaris 10 system
  2. patch systems if necessary
  3. archive the contents of the Solaris 8 system
  4. move the archive to the Solaris 10 system (if step 3 placed the archive on a file system accessible to the Solaris 10 system, e.g. via NFS, this step is unnecessary)
  5. configure and install the Solaris 8 Container, using the archive
The rest of this blog entry is a demonstration of Solaris 8 Containers, using a Sun Ultra 60 workstation for the Solaris 8 system and a Logical Domain on a Sun Fire T2000 for the Solaris 10 system. I chose those two only because they were available to me. Any two SPARC systems could be used as long as one can run Solaris 8 and one can run Solaris 10.

Almost any Solaris 8 revision or patch level will work, but Sun strongly recommends applying the most recent patches to that system. The Solaris 10 system must be running Solaris 10 8/07, and requires the following minimum patch levels:

The first step is installation of the Solaris 8 Containers packages. You can download the Solaris 8 Containers software packages from http://sun.com/download. Installing those packages on the Solaris 10 system is easy and takes 5-10 seconds:
s10-system# pkgadd -d . SUNWs8brandr  SUNWs8brandu  SUNWs8p2v
Now we can patch the Solaris 10 system, using the patches listed above.

After patches have been applied, it's time to archive the Solaris 8 system. In order to remove the "archive transfer" step I'll turn the Solaris 10 system into an NFS server and mount it on the Solaris 8 system. The archive can be created by the Solaris 8 system, but stored on the Solaris 10 system. There are several tools which can be used to create the archive: Solaris flash archive tools, cpio, pax, etc. In this example I used flarcreate, which first became available on Solaris 8 2/04.

s10-system# share /export/home/s8-archives
s8-system# mount s10-system:/export/home/s8-archives /mnt
s8-system# flarcreate -S -n atl-sewr-s8 /mnt/atl-sewr-s8.flar
Creation of the archive takes longer than any other step - 15 minutes to an hour, or even more, depending on the size of the Solaris 8 file systems.

With the archive in place, we can configure and install the Solaris 8 Container. In this demonstration the Container was "sys-unconfig'd" by using the -u option. The opposite of that is -p, which preserves the system configuration information of the Solaris 8 system.

s10-system# zonecfg -z test8
zonecfg:test8> create -t SUNWsolaris8
zonecfg:test8> set zonepath=/zones/roots/test8
zonecfg:test8> add net
zonecfg:test8:net> set address=129.152.2.81
zonecfg:test8:net> set physical=vnet0
zonecfg:test8:net> end
zonecfg:test8> exit
s10-system# zoneadm -z test8 install -u -a /export/home/s8-archives/atl-sewr-s8.flar
              Log File: /var/tmp/test8.install.995.log
            Source: /export/home/s8-archives/atl-sewr-s8.flar
        Installing: This may take several minutes...
    Postprocessing: This may take several minutes...

            Result: Installation completed successfully.
          Log File: /zones/roots/test8/root/var/log/test8.install.995.log
This step should take 5-10 minutes. After the Container has been installed, it can be booted.
s10-system# zoneadm -z test8 boot
s10-system# zlogin -C test8
At this point I was connected to the Container's console. It asked the usual system configuration questions, and then rebooted:
[NOTICE: Zone rebooting]

SunOS Release 5.8 Version Generic_Virtual 64-bit
Copyright 1983-2000 Sun Microsystems, Inc.  All rights reserved

Hostname: test8
The system is coming up.  Please wait.
starting rpc services: rpcbind done.
syslog service starting.
Print services started.
Apr  1 18:07:23 test8 sendmail[3344]: My unqualified host name (test8) unknown; sleeping for retry
The system is ready.

test8 console login: root
Password:
Apr  1 18:08:04 test8 login: ROOT LOGIN /dev/console
Last login: Tue Apr  1 10:47:56 from vpn-129-150-80-
Sun Microsystems Inc.   SunOS 5.8       Generic Patch   February 2004
# bash
bash-2.03# psrinfo
0       on-line   since 04/01/2008 03:56:38
1       on-line   since 04/01/2008 03:56:38
2       on-line   since 04/01/2008 03:56:38
3       on-line   since 04/01/2008 03:56:38
bash-2.03# ifconfig -a
lo0:1: flags=1000849 mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
vnet0:1: flags=1000843 mtu 1500 index 2
        inet 129.152.2.81 netmask ffffff00 broadcast 129.152.2.255

At this point the Solaris 8 Container exists. It's accessible on the local network, existing applications can be run in it, or new software can be added to it, or existing software can be patched.

To extend the example, here is the output from the commands I used to limit this Solaris 8 Container to only use a subset of the 32 virtual CPUs on that Sun Fire T2000 system.

s10-system# zonecfg -z test8
zonecfg:test8> add dedicated-cpu
zonecfg:test8:dedicated-cpu> set ncpus=2
zonecfg:test8:dedicated-cpu> end
zonecfg:test8> exit
bash-3.00# zoneadm -z test8 reboot
bash-3.00# zlogin -C test8

Console:
[NOTICE: Zone rebooting]

SunOS Release 5.8 Version Generic_Virtual 64-bit
Copyright 1983-2000 Sun Microsystems, Inc.  All rights reserved

Hostname: test8
The system is coming up.  Please wait.
starting rpc services: rpcbind done.
syslog service starting.
Print services started.
Apr  1 18:14:53 test8 sendmail[3733]: My unqualified host name (test8) unknown; sleeping for retry
The system is ready.
test8 console login: root
Password:
Apr  1 18:15:24 test8 login: ROOT LOGIN /dev/console
Last login: Tue Apr  1 18:08:04 on console
Sun Microsystems Inc.   SunOS 5.8       Generic Patch   February 2004
# psrinfo
0       on-line   since 04/01/2008 03:56:38
1       on-line   since 04/01/2008 03:56:38

Finally, to learn more about Solaris 8 Containers: For those who were counting, the "three commands" were, at a minimum, flarcreate, zonecfg and zoneadm.
Posted at 07:35AM Apr 09, 2008 by Jeffrey Victor in Solaris 10 Containers  |  Comments[2]

Tuesday Apr 08, 2008
ZoiT: Solaris Zones on iSCSI Targets (aka NAC: Network-Attached Containers)

Introduction

Solaris Containers have a 'zonepath' ('home') which can be a directory on the root file system or on a non-root file system. Until Solaris 10 8/07 was released, a local file system was required for this directory. Containers that are on non-root file systems have used UFS, ZFS, or VxFS. All of those are local file systems - putting Containers on NAS has not been possible. With Solaris 10 8/07, that has changed: a Container can now be placed on remote storage via iSCSI.

Background

Solaris Containers (aka Zones) are Sun's operating system level virtualization technology. They allow a Solaris system (more accurately, an 'instance' of Solaris) to have multiple, independent, isolated application environments. A program running in a Container cannot detect or interact with a process in another Container.

Each Container has its own root directory. Although viewed as the root directory from within that Container, that directory is also a non-root directory in the global zone. For example, a Container's root directory might be called /zones/roots/myzone/root in the global zone.

The configuration of a Container includes something called its "zonepath." This is the directory which contains a Container's root directory (e.g. /zones/roots/myzone/root) and other directories used by Solaris. Therefore, the zonepath of myzone in the example above would be /zones/roots/myzone.

The global zone administrator can choose any directory to be a Container's zonepath. That directory could just be a directory on the root partition of Solaris, though in that case some mechanism should be used to prevent that Container from filling up the root partition. Another alternative is to use a separate partition for that Container, or one shared among multiple Containers. In the latter case, a quota should be used for each Container.

Local file systems have been used for zonepaths. However, many people have strongly expressed a desire for the ability to put Containers on remote storage. One significant advantage to placing Containers on NAS is the simplification of Container migration - moving a Container from one system to another. When using a local file system, the contents of the Container must be transmitted from the original host to the new host. For small, sparse zones this can take as little as a few seconds. For large, whole-root zones, this can take several minutes - a whole-root zone is an entire copy of Solaris, taking up as much as 3-5 GB. If remote storage can be used to store a zone, the zone's downtime can be as little as a second or two, during which time a file system is unmounted on one system and mounted on another.

Here are some significant advantages to iSCSI over SANs:

  1. the ability to use commodity Ethernet switching gear, which tends to be less expensive than SAN switching equipment
  2. the ability to manage storage bandwidth via standard, mature, commonly used IP QoS features
  3. iSCSI networks can be combined with non-iSCSI IP networks to reduce the hardware investment and consolidate network management. If that is not appropriate, the two networks can be separate but use the same type of equipment, reducing costs and types of in-house infrastrucuture management expertise.

Unfortunately, a Container cannot 'live' on an NFS server, and it's not clear if or when that limitation will be removed.

iSCSI Basics

iSCSI is simply "SCSI communication over IP." In this case, SCSI commands and responses are sent between two iSCSI-capable devices, which can be general-purpose computers (Solaris, Windows, Linux, etc.) or specific-purpose storage devices (e.g. Sun StorageTek 5210 NAS, EMC Celerra NS40, etc.). There are two endpoints to iSCSI communications: the initiator (client) and the target (server). A target publicizes its existence. An initiator binds to a target.

The industry's design for iSCSI includes a large number of features, including security. Solaris implements many of those features. Details can be found:

In Solaris, the command iscsiadm(1M) configures an initiator, and the command iscsitadm(1M) configures a target.

Steps

This section demonstrates the installation of a Container onto a remote file system that uses iSCSI for its transport.

The target system is an LDom on a T2000, and looks like this:

System Configuration:  Sun Microsystems  sun4v
Memory size: 1024 Megabytes
SUNW,Sun-Fire-T200
SunOS ldg1 5.10 Generic_127111-07 sun4v sparc SUNW,Sun-Fire-T200
Solaris 10 8/07 s10s_u4wos_12b SPARC
The initiator system is another LDom on the same T2000 - although there is no requirement that LDoms are used, or that they be on the same computer if they are used.
System Configuration:  Sun Microsystems  sun4v
Memory size: 896 Megabytes
SUNW,Sun-Fire-T200
SunOS ldg4 5.11 snv_83 sun4v sparc SUNW,Sun-Fire-T200
Solaris Nevada snv_83a SPARC
The first configuration step is the creation of the storage underlying the iSCSI target. Although UFS could be used, let's improve the robustness of the Container's contents and put the target's storage under control of ZFS. I don't have extra disk devices to give to ZFS, so I'll make some and use them for a zpool - in real life you would use disk devices here:
Target# mkfile 150m /export/home/disk0
Target# mkfile 150m /export/home/disk1
Target# zpool create myscsi mirror /export/home/disk0 /export/home/disk1
Target# zpool status
  pool: myscsi
 state: ONLINE
 scrub: none requested
config:

        NAME                  STATE     READ WRITE CKSUM
        myscsi                ONLINE       0     0     0
          /export/home/disk0  ONLINE       0     0     0
          /export/home/disk1  ONLINE       0     0     0
Now I can create a zvol - an emulation of a disk device:
Target# zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
myscsi   86K   258M  24.5K  /myscsi
Target# zfs create -V 200m myscsi/jvol0
Target# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
myscsi         200M  57.9M  24.5K  /myscsi
myscsi/jvol0  22.5K   258M  22.5K  -
Creating an iSCSI target device from a zvol is easy:
Target# iscsitadm list target
Target# zfs set shareiscsi=on myscsi/jvol0
Target# iscsitadm list target
Target: myscsi/jvol0
    iSCSI Name: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
    Connections: 0
Target# iscsitadm list target -v
Target: myscsi/jvol0
    iSCSI Name: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
    Alias: myscsi/jvol0
    Connections: 0
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 0x0
            VID: SUN
            PID: SOLARIS
            Type: disk
            Size:  200M
            Backing store: /dev/zvol/rdsk/myscsi/jvol0
            Status: online

Configuring the iSCSI initiator takes a little more work. There are three methods to find targets. I will use a simple one. After telling Solaris to use that method, it only needs to know what the IP address of the target is.

Note that the example below uses "iscsiadm list ..." several times, without any output. The purpose is to show the difference in output before and after the command(s) between them.

First let's look at the disks available before configuring iSCSI on the initiator:

Initiator# ls /dev/dsk
c0d0s0  c0d0s2  c0d0s4  c0d0s6  c0d1s0  c0d1s2  c0d1s4  c0d1s6
c0d0s1  c0d0s3  c0d0s5  c0d0s7  c0d1s1  c0d1s3  c0d1s5  c0d1s7
We can view the currently enabled discovery methods, and enable the one we want to use:
Initiator# iscsiadm list discovery
Discovery:
        Static: disabled
        Send Targets: disabled
        iSNS: disabled
Initiator# iscsiadm list target
Initiator# iscsiadm modify discovery --sendtargets enable
Initiator# iscsiadm list discovery
Discovery:
        Static: disabled
        Send Targets: enabled
        iSNS: disabled
At this point we just need to tell Solaris which IP address we want to use as a target. It takes care of all the details, finding all disk targets on the target system. In this case, there is only one disk target.
Initiator# iscsiadm list target
Initiator# iscsiadm add discovery-address 129.152.2.90
Initiator# iscsiadm list target
Target: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
        Alias: myscsi/jvol0
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Initiator# iscsiadm list target -v
Target: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
        Alias: myscsi/jvol0
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
                CID: 0
                  IP address (Local): 129.152.2.75:40253
                  IP address (Peer): 129.152.2.90:3260
                  Discovery Method: SendTargets
                  Login Parameters (Negotiated):
                        Data Sequence In Order: yes
                        Data PDU In Order: yes
                        Default Time To Retain: 20
                        Default Time To Wait: 2
                        Error Recovery Level: 0
                        First Burst Length: 65536
                        Immediate Data: yes
                        Initial Ready To Transfer (R2T): yes
                        Max Burst Length: 262144
                        Max Outstanding R2T: 1
                        Max Receive Data Segment Length: 8192
                        Max Connections: 1
                        Header Digest: NONE
                        Data Digest: NONE
The initiator automatically finds the iSCSI remote storage, but we need to turn this into a disk device. (Newer builds seem to not need this step, but it won't hurt. Looking in /devices/iscsi will help determine whether it's needed.)
Initiator# devfsadm -i iscsi
Initiator# ls /dev/dsk
c0d0s0    c0d0s3    c0d0s6    c0d1s1    c0d1s4    c0d1s7    c1t7d0s2  c1t7d0s5
c0d0s1    c0d0s4    c0d0s7    c0d1s2    c0d1s5    c1t7d0s0  c1t7d0s3  c1t7d0s6
c0d0s2    c0d0s5    c0d1s0    c0d1s3    c0d1s6    c1t7d0s1  c1t7d0s4  c1t7d0s7
Initiator# ls -l /dev/dsk/c1t7d0s0
lrwxrwxrwx   1 root     root         100 Mar 28 00:40 /dev/dsk/c1t7d0s0 ->
../../devices/iscsi/disk@0000iqn.1986-03.com.sun%3A02%3Ac8a82272-b354-c913-80f9-db9cb378a6f60001,0:a

Now that the local device entry exists, we can do something useful with it. Installing a new file system requires the use of format(1M) to partition the "disk" but it is assumed that the reader knows how to do that. However, here is the first part of the format dialogue, to show that format lists the new disk device with its unique identifier - the same identifier listed in /devices/iscsi.
Initiator# format
Searching for disks...done

c1t7d0: configured with capacity of 199.98MB

AVAILABLE DISK SELECTIONS:
       0. c0d0 
          /virtual-devices@100/channel-devices@200/disk@0
       1. c0d1 
          /virtual-devices@100/channel-devices@200/disk@1
       2. c1t7d0 
          /iscsi/disk@0000iqn.1986-03.com.sun%3A02%3Ac8a82272-b354-c913-80f9-db9cb378a6f60001,0
Specify disk (enter its number): 2
selecting c1t7d0
[disk formatted]
Disk not labeled.  Label it now? no

Let's jump to the end of the partitioning steps, after assigning all of the available disk space to partition 0:
partition> print
Current partition table (unnamed):
Total disk cylinders available: 16382 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders        Size            Blocks
  0       root    wm       0 - 16381      199.98MB    (16382/0/0) 409550
  1 unassigned    wu       0                0         (0/0/0)          0
  2     backup    wu       0 - 16381      199.98MB    (16382/0/0) 409550
  3 unassigned    wm       0                0         (0/0/0)          0
  4 unassigned    wm       0                0         (0/0/0)          0
  5 unassigned    wm       0                0         (0/0/0)          0
  6 unassigned    wm       0                0         (0/0/0)          0
  7 unassigned    wm       0                0         (0/0/0)          0

partition> label
Ready to label disk, continue? y

The new raw disk needs a file system.
Initiator# newfs /dev/rdsk/c1t7d0s0
newfs: construct a new file system /dev/rdsk/c1t7d0s0: (y/n)? y
/dev/rdsk/c1t7d0s0:     409550 sectors in 16382 cylinders of 5 tracks, 5 sectors
        200.0MB in 1024 cyl groups (16 c/g, 0.20MB/g, 128 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 448, 864, 1280, 1696, 2112, 2528, 2944, 3232, 3648,
Initializing cylinder groups:
....................
super-block backups for last 10 cylinder groups at:
 405728, 406144, 406432, 406848, 407264, 407680, 408096, 408512, 408928, 409344

Back on the target:
Target# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
myscsi         200M  57.9M  24.5K  /myscsi
myscsi/jvol0  32.7M   225M  32.7M  -
Finally, the initiator has a new file system, on which we can install a zone.
Initiator# mkdir /zones/newroots
Initiator# mount /dev/dsk/c1t7d0s0 /zones/newroots
Initiator# zonecfg -z iscuzone
iscuzone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:iscuzone> create
zonecfg:iscuzone> set zonepath=/zones/newroots/iscuzone
zonecfg:iscuzone> add inherit-pkg-dir
zonecfg:iscuzone:inherit-pkg-dir> set dir=/opt
zonecfg:iscuzone:inherit-pkg-dir> end
zonecfg:iscuzone> exit
Initiator# zoneadm -z iscuzone install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <2762> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1162> packages on the zone.
...
Initialized <1162> packages on zone.
Zone  is initialized.
Installation of these packages generated warnings: 
The file  contains a log of the zone installation.

There it is: a Container on an iSCSI target on a ZFS zvol.

Zone Lifecycle, and Tech Support

There is more to management of Containers than creating them. When a Solaris instance is upgraded, all of its native Containers are upgraded as well. Some upgrade methods work better with certain system configurations than others. This is true for UFS, ZFS, other local file system types, and iSCSI targets that use any of them for underlying storage.

You can use Solaris Live Upgrade to patch or upgrade a system with Containers. If the Containers are on a traditional file system which uses UFS (e.g. /, /export/home) LU will automatically do the right thing. Further, if you create a UFS file system on an iSCSI target and install one or more Containers on it, the ABE will also need file space for its copy of those Containers. To mimic the layout of the original BE you could use another UFS file system on another iSCSI target. The lucreate command would look something like this:

# lucreate -m /:/dev/dsk/c0t0d0s0:ufs   -m /zones:/dev/dsk/c1t7d0s0:ufs -n newBE

Conclusion

If you want to put your Solaris Containers on NAS storage, Solaris 10 8/07 will help you get there, using iSCSI.

Posted at 11:38AM Apr 08, 2008 by Jeffrey Victor in Solaris 10 Containers  |  Comments[1]

Friday Mar 21, 2008
High-availability Networking for Solaris Containers

Here's another example of Containers that can manage their own affairs.

Sometimes you want to closely manage the devices that a Solaris Container uses. This is easy to do from the global zone: by default a Container does not have direct access to devices. It does have indirect access to some devices, e.g. via a file system that is available to the Container.

By default, zones use NICs that they share with the global zone, and perhaps with other zones. In the past these were just called "zones." Starting with Solaris 10 8/07, these are now referred to as "shared-IP zones." The global zone administrator manages all networking aspects of shared-IP zones.

Sometimes it would be easier to give direct control of a Container's devices to its owner. An excellent example of this is the option of allowing a Container to manage its own network interfaces. This enables it to configure IP Multipathing for itself, as well as IP Filter and other network features. Using IPMP increases the availability of the Container by creating redundant network paths to the Container. When configured correctly, this can prevent the failure of a network switch, network cable or NIC from blocking network access to the Container.

As described at docs.sun.com, to use IP Multipathing you must choose two network devices of the same type, e.g. two ethernet NICs. Those NICs are placed into an IPMP group through the use of the command ifconfig(1M). Usually this is done by placing the appropriate ifconfig parameters into files named /etc/hostname.<NIC-instance>, e.g. /etc/hostname.bge0.

An IPMP group is associated with an IP address. Packets leaving any NIC in the group have a source address of the IPMP group. Packets with a destination address of the IPMP group can enter through either NIC, depending on the state of the NICs in the group.

Delegating network configuration to a Container requires use of the new IP Instances feature. It's easy to create a zone that uses this feature, making this an "exclusive-IP zone." One new line in zonecfg(1M) will do it:

zonecfg:twilight> set ip-type=exclusive
Of course, you'll need at least two network devices in the IPMP group. Using IP Instances will dedicate these two NICs to this Container exclusively. Also, the Container will need direct access to the two network devices. Configuring all of that looks like this:
global# zonecfg -z twilight
zonecfg:twilight> create
zonecfg:twilight> set zonepath=/zones/roots/twilight
zonecfg:twilight> set ip-type=exclusive
zonecfg:twilight> add net
zonecfg:twilight:net> set physical=bge1
zonecfg:twilight:net> end
zonecfg:twilight> add net
zonecfg:twilight:net> set physical=bge2
zonecfg:twilight:net> end
zonecfg:twilight>add device
zonecfg:twilight:device> set match=/dev/net/bge1
zonecfg:twilight:net> end
zonecfg:twilight>add device
zonecfg:twilight:device> set match=/dev/net/bge2
zonecfg:twilight:net> end
zonecfg:twilight> exit
As usual, the Container must be installed and booted with zoneadm(1M):
global# zoneadm -z twilight install
global# zoneadm -z twilight boot
Now you can login to the Container's console and answer the usual configuration questions:
global# zlogin -C twilight
<answer questions>
<the zone automatically reboots>

After the Container reboots, you can configure IPMP. There are two methods. One uses link-based failure detection and one uses probe-based failure detection.

Link-based detection requires the use of a NIC which supports this feature. Some NICs that support this are hme, eri, ce, ge, bge, qfe and vnet (part of Sun's Logical Domains). They are able to detect failure of the link immediately and report that failure to Solaris. Solaris can then take appropriate steps to ensure that network traffic continues to flow on the remaining NIC(s).

Other NICs do not support this link-based failure detection, and must use probe-based detection. This method uses ICMP packets ("pings") from the NICs in the IPMP group to detect failure of a NIC. This requires one IP address per NIC, in addition to the IP address of the group.

Regardless of the method used, configuration can be accomplished manually or via files /etc/hostname.<NIC-instance>. First I'll describe the manual method.

Link-based Detection

Using link-based detection is easiest. The commands to configure IPMP look like these, whether they're run in an exclusive-IP zone for itself, or in the global zone, for its NICs and for NICs used by shared-IP Containers:
# ifconfig bge1 plumb
# ifconfig bge1 twilight group ipmp0 up
# ifconfig bge2 plumb
# ifconfig bge2 group ipmp0 up
Note that those commands only achieve the desired network configuration until the next time that Solaris boots. To configure Solaris to do the same thing when it next boots, you must put the same configuration information into configuration files. Inserting those parameters into configuration files is also easy:
/etc/hostname.bge1:
twilight group ipmp0 up

/etc/hostname.bge2: group ipmp0 up
Those two files will be used to configure networking the next time that Solaris boots. Of course, an IP address entry for twilight is required in /etc/inet/hosts.

If you have entered the ifconfig commands directly, you are finished. You can test your IPMP group with the if_mpadm command, which can be run in the global zone, to test an IPMP group in the global zone, or can be run in an exclusive-IP zone, to test one of its groups:

# ifconfig -a
...
bge1: flags=201000843 mtu 1500 index 4
        inet 129.152.2.72 netmask ffff0000 broadcast 129.152.255.255
        groupname ipmp0
        ether 0:14:4f:f8:9:1d
bge2: flags=201000843 mtu 1500 index 5
        inet 0.0.0.0 netmask ff000000
        groupname ipmp0
        ether 0:14:4f:fb:ca:b
...

# if_mpadm -d bge1
# ifconfig -a
...
bge1: flags=289000842 mtu 0 index 4
        inet 0.0.0.0 netmask 0
        groupname ipmp0
        ether 0:14:4f:f8:9:1d
bge2: flags=201000843 mtu 1500 index 5        inet 0.0.0.0 netmask ff000000
        groupname ipmp0
        ether 0:14:4f:fb:ca:b
bge2:1: flags=201000843 mtu 1500 index 5
        inet 129.152.2.72 netmask ffff0000 broadcast 129.152.255.255
...

# if_mpadm -r bge1
# ifconfig -a
...
bge1: flags=201000843 mtu 1500 index 4        inet 129.152.2.72 netmask ffff0000 broadcast 129.152.255.255
        groupname ipmp0
        ether 0:14:4f:f8:9:1d
bge2: flags=201000843 mtu 1500 index 5        inet 0.0.0.0 netmask ff000000
        groupname ipmp0
        ether 0:14:4f:fb:ca:b
...

If you are using link-based detection, that's all there is to it!

Probe-based detection

As mentioned above, using probe-based detection requires more IP addresses:

/etc/hostname.bge1:
twilight netmask + broadcast + group ipmp0 up addif twilight-test-bge1 \
deprecated -failover netmask + broadcast + up
/etc/hostname.bge2:
twilight-test-bge2 deprecated -failover netmask + broadcast + group ipmp0 up
Three entries for hostname and IP address pairs will, of course, be needed in /etc/inet/hosts.

All that's left is a reboot of the Container. If a reboot is not practical at this time, you can accomplish the same effect by using ifconfig(1M) commands:

twilight# ifconfig bge1 plumb
twilight# ifconfig bge1 twilight netmask + broadcast + group ipmp0 up addif \
twilight-test-bge1 deprecated -failover netmask + broadcast + up
twilight# ifconfig bge2 plumb
twilight# ifconfig bge2 twilight-test-bge2 deprecated -failover netmask + \
broadcast + group ipmp0 up

Conclusion

Whether link-based failure detection or probe-based failure detection is used, we have a Container with these network properties:

  1. Two network interfaces
  2. Automatic failover between the two NICs
You can expand on this with more NICs to achieve even more resiliency or even greater bandwidth.
Posted at 01:50PM Mar 21, 2008 by Jeffrey Victor in Solaris 10 Containers  |  Comments[5]

Tuesday Feb 05, 2008
New Containers Guide
Recently Version 2.0 of the German document "Solaris Container Leitfaden," was released. This 93-page manual (Leitfaden=manual/book) includes best practices and cookbooks about Containers. You can find it at http://blogs.sun.com/solarium/entry/solaris_container_leitfaden_2_0. There is discussion of a translation into English. If that would be useful to you, please add a comment.
Posted at 11:08AM Feb 05, 2008 by Jeffrey Victor in Solaris 10 Containers  |  Comments[8]

Tuesday Oct 16, 2007
Solaris Training at LISA'07

It's time for a "shameless plug"...

If you would like to develop deeper Solaris skills, LISA'07 offers some excellent opportunities. LISA is a conference organized by Usenix, and is intended for Large Installation System Administrators. This year, LISA will be held in Dallas, Texas, November 11-16. It includes vendor exhibits, training sessions and invited talks. This year the keynote address will be delivered by John Strassner, Motorola Fellow and Vice President, and is entitled "Autonomic Administration: HAL 9000 Meets Gene Roddenberry."

Many tutorials will be available, including four full-day sessions focusing on Solaris:

My session covers the concepts, uses and administrative interfaces of Solaris Resource Management, focusing on Solaris Containers and Projects.

Early-bird registration ends this Friday, October 19 and saves $Hundreds compared to the Procrastinator's Rate .

Posted at 10:41AM Oct 16, 2007 by Jeffrey Victor in Solaris 10 Containers  |  Comments[1]

Wednesday Sep 26, 2007
Interview about Containers Deployment
Wondering what all the Solaris Containers excitement is about? Virtualization Strategy Magazine interviewed Mike Sink from Kichler Lighting recently. You can listen to the podcast.
Posted at 10:26AM Sep 26, 2007 by Jeffrey Victor in Solaris 10 Containers  |  Comments[1]

Monday Sep 24, 2007
First Launch
I finally launched my LOC Vulcanite rocket for the first time. The results were outstanding.

I blogged about my Vulcanite earlier this year. This rocket is 53" tall (4.5 ft, 135 cm) and weighs 32 oz (2 pounds, about 1 kg) before adding a motor. I painted it orange and black to make it more visible against blue sky or light clouds.

My goals for this rocket include:

  1. determining that it is flight worthy
  2. obtaining my Level 1 certification
  3. gaining experience with high-power motors
  4. flying a rocket to one mile (1.6 km) altitude
  5. if it seems that the rocket will survive Mach 1, attempt to do so
On the day I first launched it, I achieved (1), (2) and (3). The first launch, and L1 certification attempt, was on an AeroTech H73J. This motor weighs 10 ounces when ready for launch, and is about 6" (15 cm) long. It provides 16 pounds of force at liftoff, sufficient to launch this rocket easily, but not so much that I have far to walk if it decides to become a "cruise missile" by turning and flying horizontally.

The results were gratifying.
(When I take pictures of a launch, I press the shutter as soon as I see any vertical movement, which resulted in a well-composed picture. At least it did this time...)

According to the on-board altimeter I added, it flew to 1,584 feet (480 m). More importantly, it flew almost perfectly straight up, and the 24-inch parachute returned it safely to Earth not far away from the launch rail. However, it seems that the delay I chose - the time before the parachute is ejected - was not long enough. With the correct delay, the rocket would have flown higher.

Beaming with success, I decided that the next launch would begin to test the limits of this rocket. I chose an I218R - an 8-inch (20 cm) motor with almost twice the total impulse of the previous motor. (Think of total impulse as the total force exerted while the motor is burning.) Even though I knew it would fly much higher, the wind was very light that day, so I didn't expect to walk far to recover it.

With this motor, the Vulcanite flew to 4,469 feet (1.35 km)! Also impressive was its maximum speed: over 500 MPH (800 km/h). You can see that in the picture to the right: I have an itchy shutter finger, but the rocket launched so fast I missed it entirely!

Unfortunately, although the nose cone ejected properly, the parachute never came out. The two ends of the rocket, connected by an elastic cord, fell over 4,000 feet to the ground. Fortunately, the launch area was an empty corn field with large clods of dirt which had been softened by rain the day before. The only damage was a partial crack in one plywood fin. A little sanding, some new epoxy, and it should fly again.

To one mile?

Posted at 01:00PM Sep 24, 2007 by Jeffrey Victor in Rocketry  |  Comments[1]

Tuesday Sep 11, 2007
Get Yer Solaris 10 8/07!
Availability of the newest update to Solaris 10 was officially announced today. It includes these improvements, among others: Details are available at docs.sun.com. You can download the software at http://www.sun.com/software/solaris/get.jsp.

Posted at 03:18PM Sep 11, 2007 by Jeffrey Victor in General  | 

Wednesday Sep 05, 2007
New Zones Features
On September 4, 2007, Solaris 10 8/07 became available. You can download it - free of charge - at: http://www.sun.com/software/solaris/. Just click on "Get Software" and follow the instructions.

This update to Solaris 10 has many new features. Of those, many enhance Solaris Containers either directly or indirectly. This update brings the most important changes to Containers since they were introduced in March of 2005. A brief introduction to them seems appropriate, but first a review of the previous update.

Solaris 10 11/06 added four features to Containers. One of them is called "configurable privileges" and allows the platform administrator to tailor the abilities of a Container to the needs of its application. I blogged about configurable privileges before, so I won't say any more here.

At least as important as that feature was the new ability to move (also called 'migrate') a Container from one Solaris 10 computer to another. This uses the 'detach' and 'attach' sub-commands to zoneadm(1M).

Other, minor new features, included:

New Features in Solaris 10 8/07 that Enhance Containers

New Resource Management Features

Solaris 10 8/07 has improved the resource management features of Containers. Some of these are new resource management features and some are improvements to the user interface. First I will describe three new "RM" features.

Earlier releases of Solaris 10 included the Resource Capping Daemon. This tool enabled you to place a 'soft cap' on the amount of RAM (physical memory) that an application, user or group of users could use. Excess usage would be detected by rcapd. When it did, physical memory pages owned by that entity would be paged out until the memory usage decreased below the cap.

Although it was possible to apply this tool to a zone, it was cumbersome and required cooperation from the administrator of the Container. In other words, the root user of a capped Container could change the cap. This made it inappropriate for potentially hostile environments, including service providers.

Solaris 10 8/07 enables the platform administrator to set a physical memory cap on a Container using an enhanced version of rcapd. Cooperation of the Container's administrator is not necessary - only the platform administrator can enable or disable this service or modify the caps. Further, usage has been greatly simplified to the following syntax:

global# zonecfg -z myzone
zonecfg:myzone> add capped-memory
zonecfg:myzone:capped-memory> set physical=500m
zonecfg:myzone:capped-memory> end
zonecfg:myzone> exit
The next time the Container boots, this cap (500MB of RAM) will be applied to it. The cap can be also be modified while the Container is running, with:
global# rcapadm -z myzone -m 600m
Because this cap does not reserve RAM, you can over-subscribe RAM usage. The only drawback is the possibility of paging.

For more details, see the online documentation.

Virtual memory (i.e. swap space) can also be capped. This is a 'hard cap.' In a Container which has a swap cap, an attempt by a process to allocate more VM than is allowed will fail. (If you are familiar with system calls: malloc() will fail with ENOMEM.)

The syntax is very similar to the physical memory cap:

global# zonecfg -z myzone
zonecfg:myzone> add capped-memory
zonecfg:myzone:capped-memory> set swap=1g
zonecfg:myzone:capped-memory> end
zonecfg:myzone> exit
This limit can also be changed for a running Container:
global# prctl -n zone.max-swap -v 2g -t privileged   -r -e deny -i zone myzone
Just as with the physical memory cap, if you want to change the setting for a running Container and for the next time it boots, you must use zonecfg and prctl or rcapadm.

The third new memory cap is locked memory. This is the amount of physical memory that a Container can lock down, i.e. prevent from being paged out. By default a Container now has the proc_lock_memory privilege, so it is wise to set this cap for all Containers.

Here is an example:

global# zonecfg -z myzone
zonecfg:myzone> add capped-memory
zonecfg:myzone:capped-memory> set locked=100m
zonecfg:myzone:capped-memory> end
zonecfg:myzone> exit

Simplified Resource Management Features

Dedicated CPUs

Many existing resource management features have a new, simplified user interface. For example, "dedicated-cpus" re-use the existing Dynamic Resource Pools features. But instead of needing many commands to configure them, configuration can be as simple as:

global# zonecfg -z myzone
zonecfg:myzone> add dedicated-cpu
zonecfg:myzone:dedicated-cpu> set ncpus=1-3
zonecfg:myzone:dedicated-cpu> end
zonecfg:myzone> exit
After using that command, when that Container boots, Solaris:
  1. removes a CPU from the default pool
  2. assigns that CPU to a newly created temporary pool
  3. associates that Container with that pool, i.e. only schedules that Container's processes on that CPU
Further, if the load on that CPU exceeds a default threshold and another CPU can be moved from another pool, Solaris will do that, up to the maximum configured amount of three CPUs. Finally, when the Container is stopped, the temporary pool is destroyed and its CPU(s) are placed back in the default pool.

Also, three existing project resource controls were applied to Containers:

global# zonecfg -z myzone
zonecfg:myzone> set max-shm-memory=100m
zonecfg:myzone> set max-shm-ids=100
zonecfg:myzone> set max-msg-ids=100
zonecfg:myzone> set max-sem-ids=100
zonecfg:myzone> exit
Fair Share Scheduler

A commonly used method to prevent "CPU hogs" from impacting other workloads is to assign a number of CPU shares to each workload, or to each zone. The relative number of shares assigned per zone guarantees a relative minimum amount of CPU power. This is less wasteful than dedicating a CPU to a Container that will not completely utilize the dedicated CPU(s).

Several steps were needed to configure this in the past. Solaris 10 8/07 simplifies this greatly: now just two steps are needed. The system must use FSS as the default scheduler. This command tells the system to use FSS as the default scheduler the next time it boots.

global# dispadmin -d FSS
Also, the Container must be assigned some shares:
global# zonecfg -z myzone
zonecfg:myzone> set cpu-shares=100
zonecfg:myzone> exit
Shared Memory Accounting

One feature simplification is not a reduced number of commands, but reduced complexity in resource monitoring. Prior to Solaris 10 8/07, the accounting of shared memory pages had an unfortunate subtlety. If two processes in a Container shared some memory, per-Container summaries counted the shared memory usage once for every process that was sharing the memory. It would appear that a Container was using more memory than it really was.

This was changed in 8/07. Now, in the per-Container usage section of prstat and similar tools, shared memory pages are only counted once per Container.

Global Zone Resource Management

Solaris 10 8/07 adds the ability to persistently assign resource controls to the global zone and its processes. These controls can be applied: Example:
global# zonecfg -z global
zonecfg:myzone> set cpu-shares=100
zonecfg:myzone> set scheduling-class=FSS
zonecfg:myzone> exit
Use those features with caution. For example, assigning a physical memory cap of 100MB to the global zone will surely cause problems...

New Boot Arguments

The following boot arguments can now be used:

Argument or OptionMeaning
-sBoot to the single-user milestone
-m <milestone>Boot to the specified milestone
-i </path/to/init>Boot the specified program as 'init'. This is only useful with branded zones.

Allowed syntaxes include:

global# zoneadm -z myzone boot -- -s
global# zoneadm -z yourzone reboot -- -i /sbin/myinit
ozone# reboot -- -m verbose
In addition, these boot arguments can be stored with zonecfg, for later boots.
global# zonecfg -z myzone
zonecfg:myzone> set bootargs="-m verbose"
zonecfg:myzone> exit

Configurable Privileges

Of the existing three DTrace privileges, dtrace_proc and dtrace_user can now be assigned to a Container. This allows the use of DTrace from within a Container. Of course, even the root user in a Container is still not allowed to view or modify kernel data, but DTrace can be used in a Container to look at system call information and profiling data for user processes.

Also, the privilege proc_priocntl can be added to a Container to enable the root user of that Container to change the scheduling class of its processes.

IP Instances

This is a new feature that allows a Container to have exclusive access to one or more network interfaces. No other Container, even the global zone, can send or receive packets on that NIC.

This also allows a Container to control its own network configuration, including routing, IP Filter, the ability to be a DHCP client, and others. The syntax is simple:

global# zonecfg -z myzone
zonecfg:myzone> set ip-type=exclusive
zonecfg:myzone> add net
zonecfg:myzone:net> set physical=bge1
zonecfg:myzone:net> end
zonecfg:myzone> exit

IP Filter Improvements

Some network architectures call for two systems to communicate via a firewall box or other piece of network equipment. It is often desirable to create two Containers that communicate via an external device, for similar reasons. Unfortunately, prior to Solaris 10 8/07 that was not possible. In 8/07 the global zone administrator can configure such a network architecture with the existing IP Filter commands.

Upgrading and Patching Containers with Live Upgrade

Solaris 10 8/07 adds the ability to use Live Upgrade tools on a system with Containers. This makes it possible to apply an update to a zoned system, e.g. updating from Solaris 10 11/06 to Solaris 10 8/07. It also drastically reduces the downtime necessary to apply some patches.

The latter ability requires more explanation. An existing challenge in the maintenance of zones is patching - each zone must be patched when a patch is applied. If the patch must be applied while the system is down, the downtime can be significant.

Fortunately, Live Upgrade can create an Alternate Boot Environment (ABE) and the ABE can be patched while the Original Boot Environment (OBE) is still running its Containers and their applications. After the patches have been applied, the system can be re-booted into the ABE. Downtime is limited to the time it takes to re-boot the system.

An additional benefit can be seen if there is a problem with the patch and that particular application environment. Instead of backing out the patch, the system can be re-booted into the OBE while the problem is investigated.

Branded Zones (Branded Containers)

Some times it would be useful to run an application in a Container, but the application is not yet available for Solaris, or is not available for the version of Solaris that is being run. To run an application like that, perhaps a special Solaris environment could be created that only runs applications for that version of Solaris, or for that operating system.

Solaris 10 8/07 contains a new framework called Branded Zones. This framework enables the creation and installation of Containers that are not the default 'native' type of Containers, but have been tailored to run 'non-native' applications.

Solaris Containers for Linux Applications

The first brand to be integrated into Solaris 10 is the brand called 'lx'. This brand is intended for x86 appplications which run well on CentOS 3 or Red Hat Linux 3. This brand is specific to x86 computers. The name of this feature is Solaris Containers for Linux Applications.

Conclusion

This was only a brief introduction to these many new and improved features. Details are available in the usual places, including http://docs.sun.com, http://sun.com/bigadmin, and http://www.sun.com/software/solaris/utilization.jsp.

Posted at 12:25PM Sep 05, 2007 by Jeffrey Victor in Solaris 10 Containers  |  Comments[2]