Wednesday Sep 26, 2007
Wednesday Sep 26, 2007
Wednesday Sep 05, 2007
This update to Solaris 10 has many new features. Of those, many enhance Solaris Containers either directly or indirectly. This update brings the most important changes to Containers since they were introduced in March of 2005. A brief introduction to them seems appropriate, but first a review of the previous update.
Solaris 10 11/06 added four features to Containers. One of them is called "configurable privileges" and allows the platform administrator to tailor the abilities of a Container to the needs of its application. I blogged about configurable privileges before, so I won't say any more here.
At least as important as that feature was the new ability to move (also called 'migrate') a Container from one Solaris 10 computer to another. This uses the 'detach' and 'attach' sub-commands to zoneadm(1M).
Other, minor new features, included:
Earlier releases of Solaris 10 included the Resource Capping Daemon. This tool enabled you to place a 'soft cap' on the amount of RAM (physical memory) that an application, user or group of users could use. Excess usage would be detected by rcapd. When it did, physical memory pages owned by that entity would be paged out until the memory usage decreased below the cap.
Although it was possible to apply this tool to a zone, it was cumbersome and required cooperation from the administrator of the Container. In other words, the root user of a capped Container could change the cap. This made it inappropriate for potentially hostile environments, including service providers.
Solaris 10 8/07 enables the platform administrator to set a physical memory cap on a Container using an enhanced version of rcapd. Cooperation of the Container's administrator is not necessary - only the platform administrator can enable or disable this service or modify the caps. Further, usage has been greatly simplified to the following syntax:
global# zonecfg -z myzone zonecfg:myzone> add capped-memory zonecfg:myzone:capped-memory> set physical=500m zonecfg:myzone:capped-memory> end zonecfg:myzone> exitThe next time the Container boots, this cap (500MB of RAM) will be applied to it. The cap can be also be modified while the Container is running, with:
global# rcapadm -z myzone -m 600mBecause this cap does not reserve RAM, you can over-subscribe RAM usage. The only drawback is the possibility of paging.
For more details, see the online documentation.
Virtual memory (i.e. swap space) can also be capped. This is a 'hard cap.' In a Container which has a swap cap, an attempt by a process to allocate more VM than is allowed will fail. (If you are familiar with system calls: malloc() will fail with ENOMEM.)
The syntax is very similar to the physical memory cap:
global# zonecfg -z myzone zonecfg:myzone> add capped-memory zonecfg:myzone:capped-memory> set swap=1g zonecfg:myzone:capped-memory> end zonecfg:myzone> exitThis limit can also be changed for a running Container:
global# prctl -n zone.max-swap -v 2g -t privileged -r -e deny -i zone myzoneJust as with the physical memory cap, if you want to change the setting for a running Container and for the next time it boots, you must use zonecfg and prctl or rcapadm.
The third new memory cap is locked memory. This is the amount of physical memory that a Container can lock down, i.e. prevent from being paged out. By default a Container now has the proc_lock_memory privilege, so it is wise to set this cap for all Containers.
Here is an example:
global# zonecfg -z myzone zonecfg:myzone> add capped-memory zonecfg:myzone:capped-memory> set locked=100m zonecfg:myzone:capped-memory> end zonecfg:myzone> exit
Many existing resource management features have a new, simplified user interface. For example, "dedicated-cpus" re-use the existing Dynamic Resource Pools features. But instead of needing many commands to configure them, configuration can be as simple as:
global# zonecfg -z myzone zonecfg:myzone> add dedicated-cpu zonecfg:myzone:dedicated-cpu> set ncpus=1-3 zonecfg:myzone:dedicated-cpu> end zonecfg:myzone> exitAfter using that command, when that Container boots, Solaris:
Also, three existing project resource controls were applied to Containers:
global# zonecfg -z myzone zonecfg:myzone> set max-shm-memory=100m zonecfg:myzone> set max-shm-ids=100 zonecfg:myzone> set max-msg-ids=100 zonecfg:myzone> set max-sem-ids=100 zonecfg:myzone> exitFair Share Scheduler
A commonly used method to prevent "CPU hogs" from impacting other workloads is to assign a number of CPU shares to each workload, or to each zone. The relative number of shares assigned per zone guarantees a relative minimum amount of CPU power. This is less wasteful than dedicating a CPU to a Container that will not completely utilize the dedicated CPU(s).
Several steps were needed to configure this in the past. Solaris 10 8/07 simplifies this greatly: now just two steps are needed. The system must use FSS as the default scheduler. This command tells the system to use FSS as the default scheduler the next time it boots.
global# dispadmin -d FSSAlso, the Container must be assigned some shares:
global# zonecfg -z myzone zonecfg:myzone> set cpu-shares=100 zonecfg:myzone> exitShared Memory Accounting
One feature simplification is not a reduced number of commands, but reduced complexity in resource monitoring. Prior to Solaris 10 8/07, the accounting of shared memory pages had an unfortunate subtlety. If two processes in a Container shared some memory, per-Container summaries counted the shared memory usage once for every process that was sharing the memory. It would appear that a Container was using more memory than it really was.
This was changed in 8/07. Now, in the per-Container usage section of prstat and similar tools, shared memory pages are only counted once per Container.
global# zonecfg -z global zonecfg:myzone> set cpu-shares=100 zonecfg:myzone> set scheduling-class=FSS zonecfg:myzone> exitUse those features with caution. For example, assigning a physical memory cap of 100MB to the global zone will surely cause problems...
| Argument or Option | Meaning |
|---|---|
| -s | Boot to the single-user milestone |
| -m <milestone> | Boot to the specified milestone |
| -i </path/to/init> | Boot the specified program as 'init'. This is only useful with branded zones. |
Allowed syntaxes include:
global# zoneadm -z myzone boot -- -s global# zoneadm -z yourzone reboot -- -i /sbin/myinit ozone# reboot -- -m verboseIn addition, these boot arguments can be stored with zonecfg, for later boots.
global# zonecfg -z myzone zonecfg:myzone> set bootargs="-m verbose" zonecfg:myzone> exit
Also, the privilege proc_priocntl can be added to a Container to enable the root user of that Container to change the scheduling class of its processes.
This also allows a Container to control its own network configuration, including routing, IP Filter, the ability to be a DHCP client, and others. The syntax is simple:
global# zonecfg -z myzone zonecfg:myzone> set ip-type=exclusive zonecfg:myzone> add net zonecfg:myzone:net> set physical=bge1 zonecfg:myzone:net> end zonecfg:myzone> exit
The latter ability requires more explanation. An existing challenge in the maintenance of zones is patching - each zone must be patched when a patch is applied. If the patch must be applied while the system is down, the downtime can be significant.
Fortunately, Live Upgrade can create an Alternate Boot Environment (ABE) and the ABE can be patched while the Original Boot Environment (OBE) is still running its Containers and their applications. After the patches have been applied, the system can be re-booted into the ABE. Downtime is limited to the time it takes to re-boot the system.
An additional benefit can be seen if there is a problem with the patch and that particular application environment. Instead of backing out the patch, the system can be re-booted into the OBE while the problem is investigated.
Solaris 10 8/07 contains a new framework called Branded Zones. This framework enables the creation and installation of Containers that are not the default 'native' type of Containers, but have been tailored to run 'non-native' applications.
This was only a brief introduction to these many new and improved features. Details are available in the usual places, including http://docs.sun.com, http://sun.com/bigadmin, and http://www.sun.com/software/solaris/utilization.jsp.
Tuesday Jul 24, 2007
Solaris Containers for Linux Applications is the first implementation in the BrandZ framework, and runs Red Hat and CentOS applications. In addition, non-Sun members of the OpenSolaris community have begun contributing new code to BrandZ. Albert Lee has demonstrated the ability to create a Debian zone.
Also, Wei Shen is leading an effort to enable 64-bit apps to run in Linux-branded Containers.
Thursday Jul 12, 2007
Solaris Containers (aka Zones) is a virtualization tool that has other powerful, but less well known uses. These rely on a unique combination of features:
By default, Solaris Containers are more secure than general-purpose operating systems in many ways. For example, even the root user of a Container with a default configuration cannot modify the Container's operating system programs. That limitation prevents trojan horse attacks which replace those programs. Also, a process running in a Container cannot directly modify any kernel data, nor can it modify kernel modules like device drivers. Glenn Brunette created an excellent slide deck that describes the multiple layers of security in Solaris 10, of which Containers can be one layer.
Even considering that level of security, the ability to selectively remove Solaris privileges can be used to further tighten a zone's security boundary. In addition, the ability to disable network services prevents almost all network-based attacks. This is very difficult to accomplish in most operating systems without making the system unusable or unmanageable.
The combination of those abilities and the resource controls that are part of Containers' functionality enables you to configure an application environment that can do little more than fulfill the role you choose for it.
This blog entry describes a method that can be used to slightly expand a Container's abilities, and then tighten the security boundary snugly around the Container's intended application.
Imagine that you want to run an application on a Solaris system, but the workload(s) running on this system should not be directly attached to the Internet. Further, imagine that the application needs an accurate sense of time. Today this can be done by properly configuring a firewall to allow the use of an NTP client. But now there's another way... (If this concept sounds familiar, it is because this idea has been mentioned before here and here.)
To achieve the same goal without a firewall, you could use two Solaris "virtual environments" (zones): one that has "normal" behavior, for the application, and one that has the ability to change the system's clock, but has been made extremely secure by meeting the following requirements:
Any zone can be configured to have access to one or more network ports (NICs). Further, OpenSolaris build 57 and newer builds, and the next update to Solaris 10, enable a zone to have exclusive access to a NIC, further isolating network activity of different zones. This feature is called IP Instances and will be mentioned again a bit later. A zone has its own SSM (Solaris Services Manager). Most of the services managed by SSM can be disabled if you are limiting the abilities of a zone. The zone that will manage the time clock can be configured so that it does not respond to any network connection requests by disabling all non-essential services. Also, Solaris Configurable Privileges enables us to remove unnecessary privileges from the zone, and add the one non-default privilege it needs: sys_time. That privilege is needed in order to use the stime(2) system call.
|
|
Here is the configuration for the zone when I initially created it:
zonecfg -z timelord zonecfg:timelord> create zonecfg:timelord> set zonepath=/zones/roots/timelord zonecfg:timelord> exit
After the zone has been booted and halted once, disabling services in Solaris is easy - the svcadm(1M) command does that. Through experimentation I found that this script disabled all of the network services - and some non-network services, too - but left enough services running that the zone would boot and NTP client software would run. Note that this is less important starting with Solaris 10 11/06: new installations of Solaris 10 will offer the choice to install "Secure By Default" with almost all network services turned off.
To use that script, I booted the zone and logged into it from the global zone - something you can do with zlogin(1) even if the zone does not have access to a NIC. Then I copied the script from the global zone into the non-global zone. A secure method to do this is: as the root user of the global zone, create a directory in <zonepath>/root/tmp, change its permissions to prevent access by any user other than root, and then copy the script into that directory. All of that allowed the script to be run by the root user of the non-global zone. Those steps can be accomplished with these commands:
global# mkdir /zones/roots/timelord/root/tmp/ntpscript global# chmod 700 /zones/roots/timelord/root/tmp/ntpscript global# cp ntp-disable-services /zones/roots/timelord/root/tmp/ntpscript global# zlogin timelord timelord# chmod 700 /tmp/ntpscript/disable-services timelord# /tmp/ntpscript/disable-services
Now we have a zone that only starts the services needed to boot the zone and run NTP. Incidentally, many other commands will still work, but they don't need any additional privileges.
The next step is to gather the minimum list of Solaris privileges needed by the reduced set of services. Fortunately, a tool has been developed that helps you determine the minimum necessary set of privileges: privdebug.
Here is a sample use of privdebug, which was started just before booting the zone, and stopped after the zone finished booting:
global# ./privdebug.pl -z timelord STAT PRIV USED sys_mount USED sys_mount USED sys_mount USED sys_mount USED sys_mount USED proc_exec USED proc_fork USED proc_exec USED proc_exec USED proc_fork USED contract_event USED contract_event <many lines deleted> ^C global#Running that output through sort(1) and uniq(1) summarizes the list of privileges needed to boot the zone and our minimal Solaris services. Limiting a zone to a small set of privileges requires using the zonecfg command:
global# zonecfg -z timelord zonecfg:timelord> set limitpriv=file_chown,file_dac_read,file_dac_write,file_owner,prov_exec,proc_fork,proc_info,proc_session,proc_setid,proc_taskid,sys_admin,sys_mount,sys_resource zonecfg:timelord> exitAt this point the zone is configured without unnecessary privileges and without network services. Next we must discover the privileges needed to run our application. Our first attempt to run the application may succeed. If that happens, there is no need to change the list of privileges that the zone has. If the attempt fails, we can determine the missing privilege(s) with privdebug.
For this example I will use ntpdate(1M) to synchronize the system's time clock with time servers on the Internet. In order for ntpdate to run, it needs network access, which must be enabled with zonecfg. When adding a network port, I increased zone isolation with a new feature in OpenSolaris called IP Instances. Use of this feature is not required, but it does improve network isolation and network configuration flexibility. You can choose to ignore this feature if you are using a version of Solaris 10 which does not offer it, or if you do not want to dedicate a NIC to this purpose.
To use IP Instances, I added the following parameters via zonecfg:
global# zonecfg -z timelord zonecfg:timelord> set ip-type=exclusive zonecfg:timelord> add net zonecfg:timelord:net> set physical=bge1 zonecfg:timelord:net> end zonecfg:timelord> zonecfg:timelord> exit global#Setting ip-type=exclusive quietly adds the net_rawaccess privilege and the new sys_ip_config privilege to the zone's limit set. This happens whenever the zone boots. These privileges are required in exclusive-IP zones.
We can assign a static address to the zone with the usual methods of configuring IP addresses on Solaris systems. For example, you could boot the zone, login to it, and enter the following command:
timelord# echo "192.168.1.11/24" > /etc/hostname.bge1However, because the root user of the global zone can access any of the zone's files, you can do the same thing without booting the zone by using this command instead:
global# echo "192.168.1.11/24" > /zones/roots/timelord/root/etc/hostname.bge1
With network access in place, we can discover the list of privileges necessary to run the NTP client. First boot the zone:
global# zoneadm -z timelord bootAfter the zone boots, in one window run the privdebug script, and then in another window run the NTP client in the NTP zone:
global# ./privdebug.pl -z timelord STAT PRIV USED proc_fork USED proc_exec USED proc_fork USED proc_exec NEED sys_time ^Cglobal# |
global# zlogin timelord
timelord# ntpdate -u <list of NTP server IP addresses>
16 May 13:12:27 ntpdate[24560]: Can't adjust the time of day: Not owner
timelord#
|
That output shows us that the privilege 'sys_time' is the only additional one needed to enable the zone to set the system time clock using ntpdate(1M).
Again we use zonecfg to modify the zone's privileges:
global# zonecfg -z timelord zonecfg:timelord> set limitpriv=file_chown,file_dac_read,file_dac_write,file_owner,prov_exec,proc_fork,proc_info,proc_session,proc_setid,proc_taskid,sys-admin,sys_mount,sys_resource,sys_time zonecfg:timelord> exit
While isolating the zone, why not also limit the amount of resources that it can consume? If the zone is operating normally the use of resource management features is unnecessary, but they are easy to configure and their use in this situation could be valuable. These limits could reduce or eliminate the effects of a hypothetical bug in ntpdate which might cause a memory leak or other unnecessary use of resources.
Capping the amount of resources which can be consumed by the zone is also another layer of security in this environment. Resource constraints can reduce or eliminate risks associated with a denial of service attack. Note that the use of these features is not necessary. Their use is shown for completeness, to demonstrate what is possible.
A few quick tests with rcapstat(1) showed that the zone needed less than 50MB of memory to do its job. A cap on locked memory further minimized the zone's abilities without causing a problem for NTP. As with IP Instances, these features are available in OpenSolaris and will be in the next update to Solaris 10.
global# zonecfg -z timelord zonecfg:timelord> add capped-memory zonecfg:timelord:capped-memory> set physical=50m zonecfg:timelord:capped-memory> set swap=50m zonecfg:timelord:capped-memory> set locked=20m zonecfg:timelord:capped-memory> end zonecfg:timelord> set scheduling-class=FSS zonecfg:timelord> set cpu-shares=1 zonecfg:timelord> set max-lwps=200 global#
Assigning one share to the zone prevents the zone from using too much CPU power and impacting other workloads. It also guarantees that other workloads will not prevent this zone from getting access to the CPU. Capping the number of threads (lwps) limits the ability to use up a fixed resource: process table slots. That limit is probably not necessary given the strict memory caps, but it can't hurt.
Now that we have 'shrink-wrapped' the security boundary even more tightly than the default, we're ready to use this zone.
global# zoneadm -z timelord boot global# zlogin timelord timelord# ntpdateThe output of ntpdate shows that that it was able to contact an NTP server and adjust this system's time clock by almost 0.4 seconds.16 May 14:40:35 ntpdate[25070]: adjust time server
offset -0.394755 sec
Experience with Solaris privileges can allow you to further tighten the security boundary. For example, if you want to prevent the zone from changing its own host name, you could remove the sys_admin privilege from the zone's limit set. Doing so, and then rebooting the zone, would allow you to demonstrate this:
timelord# hostname drwho hostname: error in setting name: Not owner timelord#What privilege is needed to use the hostname(1M) command?
timelord# ppriv -e -D hostname drwho hostname[4231]: missing privilege "sys_admin" (euid = 0, syscall = 139) needed at systeminfo+0x139 hostname: error in setting name: Not owner
Before disabling services, I ran "netstat -a" on another zone which had just been created. It showed a list of 13 ports to which services were listening, including ssh and sunrpc services. After hardening the zone 'timelord' by disabling unneeded services, "netstat -a" doesn't show any open ports.
In order to further evaluate the security of the configuration described above, Nessus was used to evaluate the possible attack vectors. It did not find any security weaknesses.
What else can be secured using this method? Typical Unix services like sendmail and applications like databases are ideal candidates. What application do you want to secure?
Thanks to Glenn Brunette for assistance with security techniques and to Bob Bownes for providing a test platform and assistance with Nessus.
Thursday May 10, 2007
No, not that DOS.
I'm referring to Denial-of-Service.
A team at Clarkson University including a professor and several students recently performed some interesting experiments. They wanted to determine how server virtualization solutions handled a guest VM which performed a denial-of-service attack on the whole system. This knowledge could be useful when virtualizing guests that you don't trust. It gives you a chance to put away the good silver.
They tested VMware Workstation, Xen, OpenVZ, and Solaris Containers. (It's a shame that they didn't test VMware ESX. VMware Workstation and ESX are very different technologies. Therefore, it is not safe to assume that the paper's conclusions regarding VMware Workstation apply to ESX.) After reading the paper, my conclusion for Solaris Containers is "they have non-default resource management controls to contain DoS attacks, and it's important to enable those controls."
Fortunately, with the next update to Solaris 10 (due this summer) those controls are much easier to use. For example, the configuration parameters used in the paper, and shown below, limit a Container's use of physical memory, virtual memory, and amount of physical memory which can be locked so that it doesn't get paged out:
add capped-memory set physical=128M set swap=512M set locked=64M endFurther, the following parameters limit the number of execution threads that the Container can use, turn on the fair-share scheduler and assign a quantity of shares for this Container:
set max-lwps=175 set scheduling-class=FSS set cpu-shares=10All of those parameters are set using the zonecfg(1M) command. One benefit of the centralization of these control parameters is that they move with a Container when it is moved to another system.
I partly disagree with the authors' statement that these controls are complex to configure. The syntax is simple - and a significant improvement over previous versions - and an experienced Unix admin can determine appropriate values for them without too much effort. Also, a GUI is available for those who don't like commands: the Solaris Container Manager. On the other hand, managing these controls does require Solaris administration experience, and there are no default values. It is important to use these features in order to protect well-behaved workloads from misbehaving workloads.
It also is a shame that the hardware used for the tests was a desktop computer with limited physical resources. For example it had only one processor. Because multi-core processors are becoming the norm, it would be valuable to perform the same tests on a multi-core system. The virtualization software would be stressed in ways which were not demonstrated. I suspect that Containers would handle that situation very well, for two reasons:
Also, the test system did not have multiple NICs. The version of Solaris that was used includes a new feature called IP Instances. This feature allows a Container to be given exclusive access to particular NIC. No process outside that Container can access that NIC. Of course, multiple NICs are required to use that feature...
The paper Quantifying the Performance Isolation Properties of Virtualization Systems will be delivered at the ACM's Workshop on Experimental Computer Science.
Wednesday Apr 11, 2007
Sometimes I wonder too much. This is one of those times.
Solaris 10 11/03 introduced the ability to migrate a Solaris non-global zone from one computer to another. Use of this feature is supported for two computers which have the same CPU type and substantially similar package sets and patch levels. (Note that non-global zones are also called simply 'zones' or, more officially, 'Solaris Containers.')
But I wondered... what would happen if you migrated a zone from a SPARC system to an x86/x64 system (or vice versa)? Would it work?
Theoretically, it depends on how hardware-dependent Solaris is. With a few minor exceptions, Solaris has only one source-code base, compiled for each hardware architecture on which it runs. The exceptions are things like device drivers and other components which operate hardware directly. But none of those components are part of a zone... (eery foreshadowing music plays in the background)
Of course, programs compiled for one architecture won't run on another one. If a zone contains a binary program and you move the zone to a computer with a different CPU type, that program will not run. I wondered: do zones include binary programs?
The answer to that question is "it depends." A sparse-root zone, which is the default type, does not include binary programs except for a few in /etc/fs, /etc/lp/alerts and /etc/security/lib which are no longer used and didn't belong there in the first place. In fact, when a zone is not running, it is just a bunch of configuration files of these types:
In addition, when a zone is booted, a few loopback mounts (see lofs(7FS)) are created from the global zone into the non-global zone. They include directories like /usr and /sbin - the directories that actually contain the Solaris programs. Those loopback mounts make all of the operating system's programs available to a zone's processes when the zone is running.
Although the information about the mount points moves with a migrating (sparse-root) zone, the contents of those mount points don't move with the zone... (there's that music again)
On the other hand, a whole-root zone contains its own copy of almost all files in a Solaris instance, including all of the binary programs. Because of that, a whole-root zone cannot be moved to a system with a different CPU type.
To test the migration of a sparse-root zone across CPU types, I created a zone on a SPARC system and used the steps shown in my "How to Move a Container" guide to move it to an x86 system. Note that in step 2 of the section "Move the Container", pax(1) is used to ensure that there are no endian issues.
The original zone had this configuration on an Ultra 30 workstation:
sparc-global# zonecfg -z bennu zonecfg:bennu> create zonecfg:bennu> set zonepath=/zones/roots/bennu zonecfg:bennu> add net zonecfg:bennu:net> set physical=hme0 zonecfg:bennu:net> set address=192.168.0.31 zonecfg:bennu:net> end zonecfg:bennu> exit exit
When configuring the new zone, you must specify any hardware differences. In my case, the NIC on the original system was hme0. On the destination system (a Toshiba Tecra M2) it was e1000g0. I chose to keep the same IP address for simlicity. After moving the archive file to the Tecra and unpacking it into /zones/roots/phoenix, it was time to configure the new zone. The zonecfg session for the new zone looked like this:
x86-global# zonecfg -z phoenix zonecfg:phoenix> create -a /zones/roots/phoenix zonecfg:phoenix> select net physical=hme0 zonecfg:phoenix:net> set physical=e1000g0 zonecfg:phoenix:net> end zonecfg:phoenix> exit exit
By specifying the change in hardware, the appropriate actions are implemented by zoneadm when the zone boots.
The zoneadm(1M) command is used to attach a zone's detached files to their new computer as a new zone. When used to attach a zone, the zoneadm(1M) command compares the zone's package and patch information - generated when detaching the zone - to the package and patch information of the new host for the zone. Unfortunately (for this situation) patch numbers are different for SPARC and x86 systems. As you might guess, attaching a zone which was first created on a SPARC system, to an x86 system, caused zoneadm to emit numerous complaints, including:
These packages installed on the source system are inconsistent with this system:
(SPARC-specific packages)
...
These pacakges installed on this system were not installed on the source system:
(x86-specific packages)
...
These patches installed on the source system are inconsistent with this system:
118367-04: not installed
(other SPARC-specific patches)
...
These patches installed on this system were not installed on the source system:
118668-10
(other x86-specific patches)
...
If zoneadm detects sufficient differences in packages and patches, it does not attach the zone. Fortunately, for situations like this, when you know what you are doing (or pretend that you do...) and are willing to create a possibly unsupported configuration, the 'attach' sub-command to zoneadm has its own -F flag. The use of that flag tells zoneadm to attach the zone even if there are package and/or patch inconsistencies.
After forcing the attachment, the zone boots correctly. It uses programs in the loopback-mounted file systems /usr, /lib and /sbin. Other applications could be loopback-mounted into /opt as long as that loopback mount is modified, if necessary, when the zone is attached to the new system.
My goals were:
Tuesday Apr 03, 2007
Here is Yet Another Creative Use of Zones:
Overcoming some obstacles that developers face when using Solaris Containers (aka Zones), Doug Scott documented a method of building a zone which will never be patched from the global zone. In other words, when a patch is applied to the global zone, it will not be applied to a zone built using this method, even if the patch is for a package which is marked ALLZONES=true.
Normally, a package with that parameter setting will require that the package be installed in all zones, and patched consistently in all zones. Branded zones, also called 'non-native zones,' are exempt from that rule. Branded zones allow you to create a zone which will run applications meant for another operating system or operating system version. The first official brand is 'lx'. An lx-branded zone can run most Linux applications.
Note that this method would not be supported by Sun for the following reasons:
Also, note that a zone built like that will no longer benefit from one of the key advantages of zones: management simplicity. You must figure out which patches must be applied to a cbe-branded zone.
However, if those don't bother you, or if you want to learn more about how zones really work, take a look: http://www.opensolaris.org/os/project/xfce/building_xfce/brandzbuild/
Thursday Mar 29, 2007
Did you know that Solaris Containers has the largest HCL of any server virtualization solution?
Here are three examples:
Is that metric relevant? Many factors should affect your virtualization choice. One of them is hardware choice: "does my choice of server virtualization technology limit my choice of hardware platform?"
The data points above show sufficient choice in commodity hardware for most people, but Containers maximizes your choice, and only Containers is supported on multiple hardware architectures.
Thursday Mar 22, 2007
Two previous blogs described my quest to create and boot 500 zones on one system as efficiently as possible, given my hardware constraints. But my original goal was testing the sanity of the limit of 8,191 zones per Solaris instance. Is the limit too low, or absurdly high? Running 500 zones on a sufficiently large system seemed reasonable if the application load was sufficiently small per zone. How about 1,000 zones?
Modifying my scripts to create the 501st through 1,000th zones was simple enough. The creation of 500 zones went very smoothly. Booting 1,000 zones seemed too easy...until somewhere in the 600's. Further zones didn't boot, or booted into administrative mode.
Several possible obstacles occurred to me, but a quick check of Richard and Jim's new Solaris Internals edition helped me find the maximum number of processes currently allowed on the system. The value was a bit over 16,000. And those 600+ zones were using them all up. A short entry in the global zone's /etc/system file increased the maximum number of processes to 25,000:
set max_nprocs=25000
Unfettered by a limit on the number of concurrent processes, I re-booted all the zones. More then 900 booted, but the same behavior returned: many zones did not boot properly. The running zones were not using all 25,000 PID slots. To re-diagnose the problem I first verified that I could create 25,000 processes with a "limited fork bomb." I was temporarily stumped until a conversation I had with some students in my LISA'06 class "Managing Resources with Solaris 10 Containers." One of them had experienced a problem on a very large Sun computer that was running hundreds of applications, though they weren't using Containers.
They found that they were being limited by the amount of software thread (LWP) stack space in the kernel. LWP stack pages are one of the portions of kernel memory that are pageable. Space for pageable kernel memory is allocated when the system boots and cannot be re-sized while the kernel is running.
The default size depends on the hardware architecture. For 64-bit x86 systems the default is 2GB. The kernel tunable which controls this is segkpsize, which represents the number of kernel memory pages that are pageable. When these pages are all in use, new LWPs (threads) cannot be created.
With over 900 zones running, prstat(1M) showed over 77,000 LWPs in use. To test my guess that segkpsize was limiting my ability to boot 1,000 zones, I added the following line to /etc/system and re-booted:
set segkpsize=1048576This doubles the amount of pageable kernel memory to 4GB on AMD64 systems. With that, booting my 1,000 zones was boring, as it should be.
Final statistics for 1,000
running zones included:
Conclusions: