Largepages and other exciting things...
Time for some more thrilling technical content on the hottest blog straight outta Aldershot...So, anyway I'm looking at a few calls that basically go along the lines of "my database takes a long time to startup once the domain has been up for some time."
The problem seems database independent. A truss -D shows that the problem lies with the shmat() system call. The shmat() call is taking many minutes per Gb where it should really only take seconds. The problem is only seen on large boxes with many Gb of RAM. There is no hang or panic, but very inconvenient for customers particularly when they are failing over a database in a HA setup.
So what's up with that? A Combination of lockstat, and live savecores lead us to look into the use of large pages. Huh? Well, it turns out that when we use ISM as our shared memory type (rather than just 'regular' shared memory) by giving SHM_SHARE_MMU as a parameter to shmat, the kernel uses the largest pagesize that it can. On current UltraSPARC implementations, the largest pagesize is 4Mb.
The reasoning presumably (I didn't write the code) is that
a) Shared memory is most heavily used by databases b) Databases use a LOT of memory, typically in contigous chunks c) Using large pages is a good fit for the above
But like everything, it doesnt come for free. The issue at hand is that to use large pages, we need to have 4Mb free of contiguous physical memory for each 4Mb page. At boot time, which is when most Db's start, this is not a problem. But after a while the memory will be full of many many 8k pages which is what most other allocations are made up from.
So I did a bit of reading, and found that this is a well known computer science problem... This paper, I saw a while back on Val Henson's blog explains the rationale behind large pages and the pitfalls.
I should say that the algorithms which Solaris uses work extremely well in most cases. We're working on fixing this pathological case as I speak... ( Sep 08 2004, 05:04:58 PM BST ) Permalink

