Thursday Mar 22, 2007
Thursday Mar 22, 2007
Two previous blogs described my quest to create and boot 500 zones on one system as efficiently as possible, given my hardware constraints. But my original goal was testing the sanity of the limit of 8,191 zones per Solaris instance. Is the limit too low, or absurdly high? Running 500 zones on a sufficiently large system seemed reasonable if the application load was sufficiently small per zone. How about 1,000 zones?
Modifying my scripts to create the 501st through 1,000th zones was simple enough. The creation of 500 zones went very smoothly. Booting 1,000 zones seemed too easy...until somewhere in the 600's. Further zones didn't boot, or booted into administrative mode.
Several possible obstacles occurred to me, but a quick check of Richard and Jim's new Solaris Internals edition helped me find the maximum number of processes currently allowed on the system. The value was a bit over 16,000. And those 600+ zones were using them all up. A short entry in the global zone's /etc/system file increased the maximum number of processes to 25,000:
set max_nprocs=25000
Unfettered by a limit on the number of concurrent processes, I re-booted all the zones. More then 900 booted, but the same behavior returned: many zones did not boot properly. The running zones were not using all 25,000 PID slots. To re-diagnose the problem I first verified that I could create 25,000 processes with a "limited fork bomb." I was temporarily stumped until a conversation I had with some students in my LISA'06 class "Managing Resources with Solaris 10 Containers." One of them had experienced a problem on a very large Sun computer that was running hundreds of applications, though they weren't using Containers.
They found that they were being limited by the amount of software thread (LWP) stack space in the kernel. LWP stack pages are one of the portions of kernel memory that are pageable. Space for pageable kernel memory is allocated when the system boots and cannot be re-sized while the kernel is running.
The default size depends on the hardware architecture. For 64-bit x86 systems the default is 2GB. The kernel tunable which controls this is segkpsize, which represents the number of kernel memory pages that are pageable. When these pages are all in use, new LWPs (threads) cannot be created.
With over 900 zones running, prstat(1M) showed over 77,000 LWPs in use. To test my guess that segkpsize was limiting my ability to boot 1,000 zones, I added the following line to /etc/system and re-booted:
set segkpsize=1048576This doubles the amount of pageable kernel memory to 4GB on AMD64 systems. With that, booting my 1,000 zones was boring, as it should be.
Final statistics for 1,000
running zones included:
Conclusions: