Marvell Yukon ethernet and xVM
When xVM was integrated into Nevada in build 75, I immediately tried it on my laptop (a Toshiba M3). Only to find out that it didn't work because xVM requires GLD version 3 network drivers. My particular type of M3 unfortunately has a Marvell Yukon gigabit ethernet adapter and the skge driver from SysKonnect I had been using for the past years is not a GLD v3 driver.
While looking for a more recent version of skge (hoping for GLD v3 support in skge), I came across the myk driver written by Masayuki Murayama. This driver can be compiled as a GLD v3 driver and a quick look in the install script showed that the PCI ID of my Yukon chip was supported by myk. To compile a GLD v3 version of the driver you'll need the driver sources and a recent copy of the ON sources (for the required GLD header files).
$ gzcat myk-2.5.0.tar.gz | tar xf - $ cd myk-2.5.0 $ rm Makefile.config $ ln -s Makefile.config_gld3 Makefile.config $ vi Makefile.configChange -I to point to where you keep the ON sources:
#
# Common configuration infomations for all platforms
#
DRV = myk
include version
DFLAGS = -DDEBUG -DDEBUG_LEVEL=0 -DGEM_DEBUG_LEVEL=0 \
-DTX_BUF_SIZE=256 -DRX_BUF_SIZE=256 \
-I /export/home/ml93401/ws/onnv-gate/usr/src/uts/common change appropriately
#
CFGFLAGS = -UCONFIG_OO \
-DGEM_CONFIG_POLLING \
-DGEM_CONFIG_VLAN -UCONFIG_HW_VLAN \
-DGEM_CONFIG_CKSUM_OFFLOAD \
-DGEM_CONFIG_GLDv3 -DSOLARIS10
LDFLAGS += -dy -N misc/mac -N drv/ip
Build and install the driver:
$ make $ su # ./adddrv.shAfter a reboot we should have a GLD v3 driver usable for xVM (if the type is not 'legacy' it is a GLD v3 driver):
# dladm show-link myk0 type: non-vlan mtu: 1500 device: myk0 iwi0 type: non-vlan mtu: 1500 device: iwi0Creating a DomU now works fine (as expected). If only I had more memory in this laptop...
opensolaris
( Dec 24 2007, 02:12:10 PM CET )
Permalink
Comments [1]
Running recent Nevada builds on a VIA EPIA system
At home I use a VIA EPIA system to serve NFS for my other systems. Since I didn't want another noisy system near my desk, I chose a fanless 600 MHz VIA C3 motherboard. This system has been happily running Nevada (and Solaris 10 before that) for quite some time. As it was running an ancient build (snv_48), I decided to upgrade it to some more recent bits.
While trying to install build 71 some weeks ago, I ran into 6591195 segvn_init() may return before checking HAT_SHARED_REGIONS support where the system paniced with the message "No shared region support on x86 ". Luckily the fix for that went into snv_72, so when snv_73 became available I had another go. This time the system got a little further before it fell over:
init(1M) exited on fatal signal 9: restarting automatically init(1M) exited on fatal signal 9: restarting automatically init(1M) exited on fatal signal 9: restarting automatically
Some searching turned up 6572151 snv boot failure since snv_66 which is a dup of 6332924 snv_24 /usr/ccs/bin/as adds new HWCAP tags to previously untagged object. The problem is that libc.so is now built with HWCAP tags that specify that SSE is required while it is not required per se (if SSE is available it will be used, otherwise it will not be used).
However, since the tag says that SSE is required, a system without SSE support will no longer work. The first consumer of libc.so after boot is /sbin/init. It will fail because libc.so requires SSE, die a horrible death and get restarted (and restarted, and restarted, ...). Running isainfo on my EPIA system shows that it indeed does not have the SSE capability:
$ isainfo -v
32-bit i386 applications
ahf mmx cx8 tsc fpu
CR 6332924 is currently in the 'Fix in progress' state so there is no build which includes the fix yet. There is a workaround though: elfedit(1). As the name suggests elfedit can be used to edit ELF files, and that is what I did. By removing the SSE HWCAP tag from libc.so, I was able the get my EPIA to work with build 73.
Here is what I did to make it work: since we obviously can't easily update libc.so on the DVD, I used LiveUpgrade to create a second boot environment for build 73:
# lucreate -c nv_48 -n nv_73 -m /:/dev/dsk/c0d0s3:ufs # luupgrade -n nv_73 -u -s /mnt # DVD mounted on /mnt
Next, I mounted the build 73 BE to access the build 73 copy of /lib/libc.so
# lumount -n nv_73
elfedit is available from snv_75 onwards so I copied libc.so to another system running a recent nightly build to remove the SSE tag:
$ file /tmp/libc.so /tmp/libc.so: ELF 32-bit LSB dynamic lib 80386 Version 1 [SSE CX8 FPU], dynami cally linked, not stripped, no debugging information available $ elfedit -e 'cap:hw1 -and -cmp sse' /tmp/libc.so $ file /tmp/libc.so /tmp/libc.so: ELF 32-bit LSB dynamic lib 80386 Version 1 [CX8 FPU], dynamicall y linked, not stripped, no debugging information available
I copied the modified libc.so back and activated the nv_73 boot environment:
# luumount nv_73 # luactivate nv_73 # init 6
Success!
(Ali Bahrami, I owe you a beer!)
opensolaris
( Sep 23 2007, 03:33:25 PM CEST )
Permalink
Comments [1]
Resource control observability using kstats
One of the things I sometimes miss when using resource controls, is a simple way to see what the current usage of a particular resource control by a project or zone is. While finding out the limit for the rctl is no problem (for that we haveprctl(1)), getting the actual usage requires work and implementation knowledge.
For instance, we could get the amount of System V shared memory used by a project using ipcs -Jam and some parsing of its output. Or fire up mdb(1) and lookup the value for kpd_shmmax in the project's kproject_t struct. And, if we wanted to get the usage of another resource control (say the number of lwps), we'd need to use another tool (prstat -LJc) or know that the number of lwps is kept in the kpj_nlwps member. Hardly usable for more than the occasional peek. Plus that relying on kernel implementation details such as these structure members is highly inadvisable as they may change in the future (they probably won't, but they are not stable interfaces so don't rely on them).
The addition of the swap and locked memory resource controls by PSARC 2006/598 Swap resource control; locked memory RM introduced a number of kstats for observability:
caps:{zoneid}:swapresv_zone_{zoneid}caps:{zoneid}:lockedmem_zone_{zoneid}caps:{zoneid}:lockedmem_project_{projid}
These kstats have a 'value' statistic for the current limit and a 'usage' statistic that holds the current usage:
$ kstat -c zone_caps -n swapresv_zone_0
module: caps instance: 0
name: swapresv_zone_0 class: zone_caps
crtime 0
snaptime 102512.50351337
usage 532168704
value 18446744073709551615
zonename global
Exposing these values as kstats gives us exactly what is needed, a simple, well defined method to get the limit and usage for a resource control.
To satisfy my curiosity and to see what changes would be needed, I spent some evenings creating a prototype that adds kstats for all project.* and zone.* resource controls. The following extra kstats are available in the prototype:
caps:{projid}:contracts_project_{projid}caps:{projid}:msgids_project_{projid}caps:{zoneid}:msgids_zone_{zoneid}caps:{projid}:nlwps_project_{projid}caps:{zoneid}:nlwps_zone_{zoneid}caps:{projid}:ntasks_project_{projid}caps:{projid}:semids_project_{projid}caps:{zoneid}:semids_zone_{zoneid}caps:{projid}:shmids_project_{projid}caps:{zoneid}:shmids_zone_{zoneid}caps:{projid}:shmmem_project_{projid}caps:{zoneid}:shmmem_zone_{zoneid}
Getting a list of the current usage of all resource controls is now as simple as typing:
$ kstat -p caps:::usage caps:0:contracts_project_0:usage 33 caps:0:contracts_project_1:usage 2 caps:0:contracts_project_101:usage 0 caps:0:cryptomem_project_0:usage 0 ... caps:5:nlwps_project_0:usage 108 caps:5:nlwps_zone_5:usage 108 caps:5:ntasks_project_0:usage 15 caps:5:semids_project_0:usage 0 caps:5:semids_zone_5:usage 0 caps:5:shmids_project_0:usage 1 caps:5:shmids_zone_5:usage 1 caps:5:shmmem_project_0:usage 172032 caps:5:shmmem_zone_5:usage 172032 caps:5:swapresv_zone_5:usage 95178752
And now that we have the numbers as kstats, we can use any tool to massage the numbers into a form that suits us. The screenshot below is from a hacked up version of one of the JKstat demo programs and shows a graph of the number of LWPs in all projects and zones during boot and shutdown of a Zone.
T: OpenSolaris Solaris ( Sep 12 2007, 09:29:14 PM CEST ) Permalink Comments [3]
System V IPC resource controls for Zones
Some weeks ago, I putback my code for 6306668 (RFE: there need to be zone limits for project-based system V resource controls). The fix went into Nevada build 48 which is now available as Solaris Express 10/06 (available here).
Without such zone limits for project-based System V IPC resource controls, a non-global zone administrator could possibly starve other zones by consuming inordinate amounts of System V IPC resources. Particularly in cases where the non-global zone administrator cannot be trusted (either by malice or lack of knowledge and understanding of the impact of his actions) this can be an issue.
The existing zone.* resource controls have been extended with four new resource controls:
zone.max-shm-memory- the total amount of shared memory allowed for a zone, expressed as a number of bytes.zone.max-shm-ids- the maximum number of shared memory IDs allowed for a zone, expressed as an integer.zone.max-sem-ids- the maximum number of semaphore IDs allowed for a zone, expressed as an integer.zone.max-msg-ids- the maximum number of message queue IDs allowed for a zone, expressed as an integer.
These resource controls give the global zone administrator the ability to limit the total consumption of System V IPC resources by processes in a zone. The non-global zone administrator is still able to control the allocation of System V IPC resources inside the zone using the existing project.* resource controls. So regardless of the limits that a non-global zone administrator sets on projects in the zone, the total amount of IPC resources used by the zone can never exceed the limit set by the global zone administrator.
Setting these resource controls is done in the usual way using zonecfg(1M):
$ zonecfg -z aap zonecfg:aap> add rctl zonecfg:aap:rctl> set name=zone.max-shm-memory zonecfg:aap:rctl> add value (priv=privileged,limit=1073741824,action=deny) zonecfg:aap:rctl> end zonecfg:aap> exit
The limit will be in effect after booting the zone. Adding or changing one of these resource controls to a running zone without rebooting can be done using prctl(1M).
One thing to note is that for compatibilty reasons there are no default privileged limits on these resource controls, only a system limit. Having a default privileged limit could break existing configurations because up to now there was no limit at the zone level. Therefore, adding a limit to a running zone requires you to use the -t privileged option to add the privileged limit.
To add a 1 GB limit to a running zone you would use:
prctl -n zone.max-shm-memory -t privileged -v 1073741824 -i zone aap
Once the privileged limit is present, changing the limit to 2 GB would be done like this:
prctl -n zone.max-shm-memory -r -v 2147483648 -i zone aap
T: OpenSolaris Zones
( Oct 25 2006, 11:40:10 AM CEST ) PermalinkFirst Dutch OpenSolaris User Group meeting
The first meeting of the Dutch OpenSolaris User Group will be held next Thursday (October 26th) at the Sun office in Amersfoort. This meeting starts at 19:30 and will feature the following speakers:
- Bart Muijzer, local NLOSUG host, will introduce the NLOSUG, share some ideas, and try to get discussions going.
- Casper Dik, resident guru, and OpenSolaris Community Advisory Board (CAB) member, will introduce OpenSolaris and be on hand for questions.
- Darren Moffat, guest guru, will speak on OpenSolaris development in general and one project in particular: encryption for ZFS.
- Remco Fugers will introduce the Open Solaris Starter Kit.
More information is here.
See you there! ( Oct 19 2006, 06:58:43 PM CEST ) Permalink
Faster zone provisioning using zoneadm clone
In a recent thread on zones-discuss@opensolaris.org about creating zones in parallel to reduce the time it takes to provision multiple zones, it was suggested that the new zoneadm clone subcommand could be of help. The zoneadm clone subcommand (available from build 33 onwards) copies an installed and configured zone. Cloning a zone is faster than installing a zone, but how much faster? To find out I did some quick experiments creating and cloning both whole root and sparse root zones on a V480:
Creating a whole root zone:
# zonecfg -z zone1 zone1: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:zone1> create -b zonecfg:zone1> set zonepath=/zones/zone1 zonecfg:zone1> exit # time zoneadm -z zone1 install time zoneadm -z zone1 install Preparing to install zone. Creating list of files to copy from the global zone. Copying <123834> files to the zone. Initializing zone product registry. Determining zone package initialization order. Preparing to initialize <986> packages on the zone. Initialized <986> packages on zone. Zone is initialized. Installation of these packages generated errors: The file contains a log of the zone installation. real 13m40.647s user 2m49.840s sys 4m43.221s
Cloning a whole root zone:
# zonecfg -z zone1 export|sed -e 's/zone1/zone2/'|zonecfg -z zone2 zone2: No such zone configured Use 'create' to begin configuring a new zone. # time zoneadm -z zone2 clone zone1 Cloning zonepath /zones/zone1... real 8m4.615s user 0m9.780s sys 2m18.334s
For the whole root zone cloning is almost twice a fast as a regular install.
Creating a sparse root zone:
# zonecfg -z zone3 zone3: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:zone3> create zonecfg:zone3> set zonepath=/zones/zone3 zonecfg:zone3> exit # time zoneadm -z zone3 install Preparing to install zone. Creating list of files to copy from the global zone. Copying <2535> files to the zone. Initializing zone product registry. Determining zone package initialization order. Preparing to initialize <986> packages on the zone. Initialized <986> packages on zone. Zone is initialized. Installation of these packages generated errors: The file contains a log of the zone installation. real 6m3.227s user 1m45.902s sys 2m47.717s
Cloning a sparse root zone:
# zonecfg -z zone3 export|sed -e 's/zone3/zone4/'|zonecfg -z zone4 zone4: No such zone configured Use 'create' to begin configuring a new zone. # time zoneadm -z zone4 clone zone3 Cloning zonepath /zones/zone3... real 0m11.535s user 0m0.706s sys 0m6.440s
For the sparse root zone, cloning is more than thirty times faster then installing!
So if you need to provision multiple zones of a certain configuration, zoneadm clone is clearly the way to go.
Note that the current clone operation does not (yet) take advantage of ZFS. To see what ZFS can do for zone cloning, have a look at Mike Gerdts' blog: Zone created in 0.922 seconds. Goodness indeed.
T: OpenSolaris Zones
( Mar 18 2006, 07:12:17 PM CET ) Permalink Comments [4]Monitoring zone boot and shutdown using DTrace
Several people have expressed a desire for a way to monitor zone state transitions such as zone boot or shutdown events. Currently there is no way to get notified when a zone is booted or shutdown. One way would be to run zoneadm list -p at regular intervals and parse the output, but this has some drawbacks that make this solution less ideal:
- it is inefficient because you are polling for events,
- you will probably start at least two processes for each polling cycle (
zoneadm(1M)andnawk(1)), - more importantly, you could miss transitions if your polling interval is too large. Since a zone reboot might take only seconds, you would need to poll often in order not to miss a state change.
A better, much more efficient solution can be built using DTrace, the 'Swiss Army knife of system observability'. As mentioned in this message on the DTrace forum, the zone_boot() function looks like a promising way to get notifications when a zone is booted. Listing all FBT probes with the string 'zone_' in their name (dtrace -l fbt|grep zone_) turns up another interesting function: zone_shutdown(). To verify that these probes are fired when a zone is either booted or shutdown, let's enable both probes:
# dtrace -n 'fbt:genunix:zone_boot:entry, fbt:genunix:zone_shutdown:entry {}'
dtrace: description 'fbt:genunix:zone_boot:entry, fbt:genunix:zone_shutdown:entry ' matched 2 probes
When zoneadm -z zone1 boot is executed we see that the zone_boot:entry probe fires:
CPU ID FUNCTION:NAME 0 6722 zone_boot:entry
The zone_shutdown:entry probe fires when the zone is shutdown (either by zoneadm -z zone1 halt or using init 0 from within the zone):
0 6726 zone_shutdown:entry
This gives us the basic 'plumbing' for the monitoring script. By instrumenting the zone_boot() and zone_shutdown() functions with the FBT provider we can wait for zone boot and shutdown with almost zero overhead. Now what is left is finding out the name of the zone that was booted or shutdown. This requires some knowledge of the implementation and access to the source (anyone interested can take a look at the source after OpenSolaris is launched, so stay tuned).
A quick look at the source shows that we can get the zone name by instrumenting a third function, zone_find_all_by_id() that is called by both zone_boot() and zone_shutdown(). This function returns a pointer to a zone_t structure (defined in /usr/include/sys/zone.h). The DTrace script below uses a common DTrace idiom: in the :entry probe we set a thread-local variable trace that is used as a predicate in the :return probes (the :return probes have the information we're after). The FBT provider :return probe stores the function return value in args[1] so we can access the zone name as args[1]->zone_name in fbt:genunix:zonefind_all_by_id:return and save it for later use in fbt:genunix:zone_boot:return and fbt:genunix:zone_shutdown:return.
#!/usr/sbin/dtrace -qs
self string name;
fbt:genunix:zone_boot:entry
{
self->trace = 1;
}
fbt:genunix:zone_boot:return
/self->trace && args[1] == 0/
{
printf("Zone %s booted\n", self->name);
self->trace = 0;
self->name = 0;
}
fbt:genunix:zone_shutdown:entry
{
self->trace = 1;
}
fbt:genunix:zone_shutdown:return
/self->trace && args[1] == 0/
{
printf("Zone %s shutdown\n", self->name);
self->trace = 0;
self->name = 0;
}
fbt:genunix:zone_find_all_by_id:return
/self->trace/
{
self->name = stringof(args[1]->zone_name);
}
Starting the script and booting and shutting down some Zones gives the following result:
# ./zonemon.d Zone aap booted Zone noot booted Zone noot shutdown Zone noot booted Zone aap shutdown
So there you have it, a simple DTrace script that will efficiently wait for zone boot and shutdown events. Enjoy.
Technorati Tag: Solaris
Technorati Tag: DTrace
( May 25 2005, 12:30:38 PM CEST ) Permalink Comments [5]
Dynamic Zone changes
This subject has come up several times in the last two weeks, so it might be a good opportunity to finally start using my blog.
When talking to a colleague about Zones he said: 'I have been looking at Zones and while they are cool, they are also "static". To add an extra file system to a running zone I have to restart the zone.'. Well, as it happens, this is not required. You can dynamically add a file system to a running zone. Here's how:
The current configuration of the running zone looks like this:
# zonecfg -z zone1 info
zonepath: /export/zones/zone1
autoboot: true
pool: large
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address: 129.159.206.38/26
physical: hme0
rctl:
name: zone.cpu-shares
value: (priv=privileged,limit=10,action=none)
Adding a new UFS file system to this zone would entail the following: create a new file system in the global zone, add an fs resource to the zone configuration and restart the zone to re-read the configuration.
global # newfs /dev/md/rdsk/d100
newfs: construct a new file system /dev/md/rdsk/d100: (y/n)? y
Warning: 1280 sector(s) in last cylinder unallocated
/dev/md/rdsk/d100: 1024000 sectors in 712 cylinders of 15 tracks, 96 sectors
500.0MB in 45 cyl groups (16 c/g, 11.25MB/g, 5440 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 23168, 46304, 69440, 92576, 115712, 138848, 161984, 185120, 208256,
806720, 829856, 852992, 876128, 899264, 922400, 945536, 968672, 991808,
1014944,
global # zonecfg -z zone1
zonecfg:zone1> add fs
zonecfg:zone1:fs> set dir=/u01
zonecfg:zone1:fs> set special=/dev/md/dsk/d100
zonecfg:zone1:fs> set raw=/dev/md/rdsk/d100
zonecfg:zone1:fs> set type=ufs
zonecfg:zone1:fs> end
zonecfg:zone1> exit
At this point we could reboot the zone and have the new file system mounted during zone boot. However, there is no need to restart the zone because the file system can be mounted into the running zone from the global zone. The only thing we have to do now is add the mountpoint in the zone ourselves:
global # mkdir /export/zones/zone1/root/u01 global # mount /dev/md/dsk/d100 /export/zones/zone1/root/u01
Note that there is an extra /root/ component in the path to the file system. Inside the zone we see that the new file system has appeared:
zone1 # df -h Filesystem size used avail capacity Mounted on / 15G 3.3G 11G 23% / /dev 15G 3.3G 11G 23% /dev /lib 15G 3.3G 11G 23% /lib /platform 15G 3.3G 11G 23% /platform /sbin 15G 3.3G 11G 23% /sbin /usr 15G 3.3G 11G 23% /usr proc 0K 0K 0K 0% /proc ctfs 0K 0K 0K 0% /system/contract swap 8.3G 264K 8.3G 1% /etc/svc/volatile mnttab 0K 0K 0K 0% /etc/mnttab fd 0K 0K 0K 0% /dev/fd swap 8.3G 0K 8.3G 0% /tmp swap 8.3G 32K 8.3G 1% /var/run /u01 469M 1.0M 421M 1% /u01
But wait, there's more. The same 'magic' can be applied to add an extra network interface to a running zone. Instead of adding a net resource to the zone configuration and then rebooting the zone, we add the net resource to the zone configuration (to make the change persistent) and then use ifconfig(1M) from the global zone to add the network interface dynamically.
global # zonecfg -z zone1 zonecfg:zone1> add net zonecfg:zone1:net> set physical=hme0 zonecfg:zone1:net> set address=192.168.1.13/24 zonecfg:zone1:net> end zonecfg:zone1> exit global # ifconfig hme0 addif 192.168.1.13 netmask + broadcast + zone zone1 up Created new logical interface hme0:3 Setting netmask of hme0:3 to 255.255.255.0
The key point here is the 'zone' option of ifconfig. Running ifconfig -a inside the zone shows that we now have the extra network interface. And without having to reboot the zone!
zone1 # ifconfig -a lo0:5: flags=2001000849mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 hme0:2: flags=1000843 mtu 1500 index 2 inet 129.159.206.38 netmask ffffffc0 broadcast 129.159.206.63 hme0:3: flags=1000843 mtu 1500 index 2 inet 192.168.1.13 netmask ffffff00 broadcast 192.168.1.255
There are more things that can be changed dynamically such as resource controls and pool binding. I'll leave that for another blog entry.
So: Zones are cool and dynamic!
( Mar 10 2005, 09:30:50 AM CET ) Permalink


