A million VMs?
To complete the project, Sandia utilized its Albuquerque-based 4,480-node Dell high-performance computer cluster, known as Thunderbird. To arrive at the one million Linux kernel figure, Sandia's researchers ran one kernel in each of 250 VMs and coupled those with the 4,480 physical machines on Thunderbird.4480 machines, with 250 VMs each. While we don't have 4480 machines available, I thought I'd go for 250 Solaris domUs (VM instance) on one Solaris dom0 (controlling domain), as a proof of concept. If my fellow nerds here in New Mexico can do it, so can I!
One of the machines in our testlab has 40G memory, and 4x4 CPU cores in it, so it soundedl like a good dom0 candidate. I wanted to be safe with dom0 memory, so I pinned that at 4G. To play it safe, I wanted a small domU. I picked (Open)Solaris build 105, simply because I knew that would run in 128M of memory, in 32bit mode (32bit mode was only used because it saves a bit of memory; 64bit required 16M more per domU). This would leave enough space to add a few more domUs, should things work out. I also decided not to configure networking in the domUs, simply because there weren't enough IP addresses available in the lab network. This is just a proof of concept after all.
How to set things up? I wanted to set this up quickly, so I preferred to have the backing storage on the local disk, not on NFS. A basic install of Solaris plus swap space can be done in 1.5G. I picked 2G as sufficient for an install.
The first issue was that no more than 400G of diskspace was left on this system, so 250 * 2 = 500G of raw backing disk wasn't going to work. No problem, ZFS to the rescue. Obviously all domUs would be virtually identical, so cloned ZFS volumes are a perfect match.
After doing an install of a paravirtualized Solaris domU on a ZFS volume, I took a snapshot and then cloned it 250 times:
# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/snv_115@bfu-116-configured 181M - 7.24G -
tank/disk0@init 0 - 973M -
#!/bin/sh
i=1
while [ $i -le 250 ]
do
echo "cloning instance $i"
zfs clone tank/disk0@init tank/disk$i
i=`expr $i + 1`
done
exit 0
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
[...]
tank 22.7G 379G 25K /export
tank/disk0 2.95G 381G 973M -
tank/disk1 0 379G 973M -
tank/disk10 0 379G 973M -
tank/disk100 0 379G 973M -
tank/disk101 0 379G 973M -
tank/disk102 0 379G 973M -
tank/disk103 0 379G 973M -
tank/disk104 0 379G 973M -
tank/disk105 0 379G 973M -
tank/disk106 0 379G 973M -
tank/disk107 0 379G 973M -
tank/disk108 0 379G 973M -
tank/disk109 0 379G 973M -
tank/disk11 0 379G 973M -
tank/disk110 0 379G 973M -
tank/disk111 0 379G 973M -
tank/disk112 0 379G 973M -
tank/disk113 0 379G 973M -
tank/disk114 0 379G 973M -
tank/disk115 0 379G 973M -
tank/disk116 0 379G 973M -
tank/disk117 0 379G 973M -
tank/disk118 0 379G 973M -
tank/disk119 0 379G 973M -
[etc]
That was easy enough. Now to create all the domains, using this script:
#!/bin/sh
i=1
while [ $i -le 250 ]
do
echo "creating VM $i"
hex=`printf "%02x" $i`
echo "(
[template sxp file contents with variables]
)" > temp$i.sxp
xm new -F temp$i.sxp
rm -f temp$i.sxp
i=`expr $i + 1`
done
exit 0
I used a raw SXP config file, because I noticed that doing it via libvirt would strip the explicit kernel and ramdisk options that are needed to boot the domU 32bit, not 64bit. That's an item to be dealt with later.
That all took less than an hour to set up, and all domains were ready to go:
ginsberg# virsh list --all Id Name State ---------------------------------- 0 Domain-0 running - spv0 shut off - spv1 shut off - spv10 shut off - spv100 shut off - spv101 shut off - spv102 shut off - spv103 shut off - spv104 shut off - spv105 shut off - spv106 shut off - spv107 shut off - spv108 shut off - spv109 shut off - spv11 shut off - spv110 shut off - spv111 shut off - spv112 shut off - spv113 shut off - spv114 shut off - spv115 shut off - spv116 shut off - spv117 shut off - spv118 shut off - spv119 shut off [etc]So I started a loop to get them all going. After 25 started domains, the hypervisor complained:
(xVM) Cannot handle page request order 0!Hm... was enough memory available?
(xVM) Physical memory information: (xVM) Xen heap: 12kB free [...]Ah, Xen had run out of heap space. The Xen heap is gone in Xen 3.4, so it wouldn't have this issue. However, I was doing this on our current 3.3-based bits (we have 3.4 lined up for later this year). So, this needed to be worked around. 256M of heap space should be enough (instead of the 16M default). So, I rebooted the system with xenheap_megabytes=256 in the Xen command line, and started firing up domains again. After a while, I noticed an error message:
Unable to open tty /dev/pts/86: No such file or directorySo I stopped the loop to see what was going on. It quicky turned out that xenconsoled was running out of filedescriptors. So I upped its limit using plimit(1), and restarted the loop. Domains were happily starting up, until about the 127th domain:
panic[cpu2]/thread=ffffff000a861c60: No available IRQ to bind to: increase NR_IRQS! ffffff000a8618f0 unix:alloc_irq+158 () ffffff000a861910 unix:ec_bind_evtchn_to_irq+2e () ffffff000a861950 unix:xvdi_bind_evtchn+a3 () ffffff000a8619e0 xdb:xdb_bindto_frontend+206 () ffffff000a861a30 xdb:xdb_start_connect+ae () ffffff000a861a80 xdb:xdb_oe_state_change+99 () ffffff000a861af0 genunix:ndi_event_run_callbacks+96 () ffffff000a861b20 xpvd:xpvd_post_event+24 () ffffff000a861b50 genunix:ndi_post_event+2d () ffffff000a861ba0 unix:i_xvdi_oestate_handler+94 () ffffff000a861c40 genunix:taskq_thread+1b7 () ffffff000a861c50 unix:thread_start+8 ()Oops. Ok, I bumped NR_IRQ (to be more precise, NR_DYNIRQ), and recompiled a dom0 kernel. Obviously, dom0 shouldn't fail that way when it runs out of virtual IRQ space, but that's an isue that can be addressed later.
The system was updated, restarted, and I restarted the loop again. This time, success! All domUs were active, and I could access all of their consoles.
xentop - 10:15:12 Xen 3.3.2-rc1-xvm-debu
252 domains: 1 running, 251 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 41942204k total, 37902624k used, 4039580k free CPUs: 16 @ 2933MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS
NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR SSID
Domain-0 -----r 6501 39.5 4194304 10.0 no limit n/a 16
0 0 0 0 0 0 0 0
spv0 --b--- 21 0.3 131072 0.3 131072 0.3 1
1 0 0 1 0 0 0 0
spv1 --b--- 21 0.3 131072 0.3 131072 0.3 1
1 0 0 1 0 0 0 0
# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 4096 16 r----- 6508.4
spv0 1 128 1 -b---- 21.5
spv1 2 128 1 -b---- 21.2
spv10 11 128 1 -b---- 23.4
spv100 101 128 1 -b---- 18.0
spv101 102 128 1 -b---- 17.7
spv102 103 128 1 -b---- 17.8
spv103 104 128 1 -b---- 17.6
spv104 105 128 1 -b---- 19.3
spv105 106 128 1 -b---- 18.1
spv106 107 128 1 -b---- 18.6
spv107 108 128 1 -b---- 18.1
spv108 109 128 1 -b---- 19.6
spv109 110 128 1 -b---- 19.2
spv11 12 128 1 -b---- 21.2
spv110 111 128 1 -b---- 18.9
spv111 112 128 1 -b---- 18.0
spv112 113 128 1 -b---- 17.7
spv113 114 128 1 -b---- 19.6
spv114 115 128 1 -b---- 17.8
spv115 116 128 1 -b---- 17.9
spv116 117 128 1 -b---- 19.9
spv117 118 128 1 -b---- 18.4
spv118 119 128 1 -b---- 18.5
spv119 120 128 1 -b---- 17.8
spv12 13 128 1 -b---- 22.7
spv120 121 128 1 -b---- 19.3
spv121 122 128 1 -b---- 18.0
spv122 123 128 1 -b---- 17.6
[..you get the idea]
=================
# virsh console spv250
v3.3.2-rc1-xvm-debu chgset 'Wed Jul 29 08:09:08 2009 -0700 18433:844795afdcb4'
SunOS Release 5.11 Version snv_105 32-bit
Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Hostname: spv
Reading ZFS config: done.
spv console login:
spv console login:
spv console login:
spv console login:
spv console login: root
Password:
Last login: Tue Jul 28 15:22:26 on console
Jul 29 10:12:19 spv login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.11 snv_105 November 2008
# w
10:12am up 3 min(s), 1 user, load average: 0.06, 0.10, 0.04
User tty login@ idle JCPU PCPU what
root console 10:12am -sh
Just for fun, I started up some more domUs, as there was memory left for about 30 more. But, there seems to be a limit (8bit limit perhaps) somewhere in the system:
starting instance 256 error: Failed to start domain spv256 error: POST operation failed: xend_post: error from xen daemon: (xend.err 'Device 0 (vif) could not be connected. Backend device not found.') syseventconfd[100801]: process 225108 exited with status 1Hmm, well, that's another item to be looked at.
All in all, this didn't take long to set up, and the domUs were running just fine. The job was made a whole lot easier by ZFS, too. The system has been running all these domUs for about half a day, and I've poked around in them a bit, without any problems.
Now I just need 4,480 of these machines..
UPDATE The last limit that I mention was actually an error in my script; it generated a bogus MAC address for the new guest. However, there is a limit of 256 guests currently, because of the value of EVTCHNDRV_DEFAULT_NCLONES, which is 256.

Hi. Discovered this blog post upon running against the NR_IRQ panic (after 60 or so guests; they need a couple of NICs each which probably spends them faster). OpenSolaris Dom0.
Any chance of giving a few more details and/or hints on how to go about increasing the limit?
Posted by Teemu Voipio on August 17, 2009 at 02:21 PM CEST #