Saturday August 06, 2005 I want to understand my system
About a year ago I joined the NUMA project which was providing support for hierarchical locality groups (HLS). Solaris already understood two-level memory hierarchies when a piece of memory can be local or remote to a certain CPU. The project elaborated on this concept and brought a more fine-grained distance information. Now scheduler and VM subsystem could discover what is the closest memory or CPU, what is the one a bit further away, what is the one even further, etc. This is especially important for AMD Opteron machines with their HyperTransport architecture. For example, on a 4-way AMD system it may take up to two hops for a CPU to get to a memory on another node.
All this notion of locality is expressed in an abstraction called locality group or lgroup for short. The lgroup is just a set of resources (memory and CPU) that are not too far from each other. They form a hierarchy with leafs containing end-node resources and the root lgroup at the top containing all the system resources. Jonathan wrote a very good introduction explaining all the details and it has some very cool pictures, too. I strongly recommend reading it if you are interested in the way Solaris deals with NUMA challenges.
I came from a different towns and willages of Solaris and didn't know anything about these locality groups. Some of you know the feeling when you join a completely new project and everyone speaks a foreign language. To get a handle on the terminology and the basic abstractions I started with improving the little MDB support available for displaying lgroups. It basically allows you to have a quick look at all lgroups and the output looks like this (on a 4-CPU AMD box):
> ::lgrp
LGRPID ADDR PARENT PLATHAND #CPU CPUS
0 fffffffffbc1ccc0 0 DEFAULT 0
1 fffffffffbc0a2a0 fffffffffbc0a380 0 1 0
2 fffffffffbc0a310 fffffffffbc0a3f0 1 1 1
3 fffffffffbc0a380 fffffffffbc1ccc0 NULL 0
4 fffffffffbc0a3f0 fffffffffbc1ccc0 NULL 0
5 fffffffffbc0a460 fffffffffbc0a4d0 2 1 2
6 fffffffffbc0a4d0 fffffffffbc1ccc0 NULL 0
7 fffffffffbc0a540 fffffffffbc0a5b0 3 1 3
8 fffffffffbc0a5b0 fffffffffbc1ccc0 NULL 0
You can get a bit more details:
{
lgrp_id = 0
lgrp_latency = 0x86
lgrp_plathand = 0xbabecafe
lgrp_parent = 0
lgrp_reserved1 = 0
lgrp_childcnt = 0x4
lgrp_children = 0x158
lgrp_leaves = 0xa6
lgrp_set = [ 0xa6, 0x6 ]
lgrp_mnodes = 0x3
lgrp_nmnodes = 0x2
lgrp_reserved2 = 0
lgrp_cpu = 0
lgrp_cpucnt = 0
lgrp_chipcnt = 0
lgrp_chips = 0
lgrp_kstat = 0
}
{
lgrp_id = 0x1
lgrp_latency = 0x42
lgrp_plathand = 0
lgrp_parent = lgrp_space+0xe0
lgrp_reserved1 = 0
lgrp_childcnt = 0
lgrp_children = 0
lgrp_leaves = 0x2
lgrp_set = [ 0x2, 0x2 ]
lgrp_mnodes = 0x1
lgrp_nmnodes = 0x1
lgrp_reserved2 = 0
lgrp_cpu = cpus
lgrp_cpucnt = 0x1
lgrp_chipcnt = 0x1
lgrp_chips = cpu0_chip
lgrp_kstat = 0xffffffff839d35e0
}
...
You got the idea. It is quite useful for engineers who have to debug this staff, but a bit obscure for most other people. So I was wondering how could I see these lgroups in a simple manner without resorting to the kernel debugger. Luckily Jonathan created a very useful library which provided all the information need at the user level. The actual thing missing was an actual application displaying these information in human-readable form. I had enough motivation to write such program, but I really wanted a simple tool to play with since I didn't know what it was going to display and Perl was a good tool to play with prototypes. The only missing part was a glue code needed to access the C library from Perl application. This was a perfect chance to dive in the underworld of XS and XSUB and I described my experience before.
It turned out that the glue code in this case was relatively easy, once I figured out how to get around some of the h2xs roadblocks. After that the Perl part was really easy - the first prototype was about a page of Perl code. For example, the wollowing piece of code was producing the full list of lgroups in the system, or any subtree:
sub lgrp_lgrps($;$)
{
my $cookie = shift;
my $root = shift;
$root = lgrp_root($cookie) unless defined $root;
return unless defined $root;
my @children = lgrp_children($cookie, $root);
my @result;
# Concatenate root with subtrees for every children.
# Every subtree is obtained by calling lgrp_lgrps recursively with each of
# the children as the argument.
@result = @children ?
($root, map {lgrp_lgrps($cookie, $_)} @children) :
($root);
return (wantarray ? @result : scalar @result);
}
Once I had the glue code and the initially prototype it was relatively easy
to write a small program
that displayed the system lgroup hierarchy in a nice form. Here is a
different look at the same hierarchy as you saw below:
$ lgrpinfo -Ta
0
|-- 5
| CPUs: 0-2
| Lgroup resources: 1 2 3 (CPU); 1 2 3 (memory)
| `-- 1
| CPU: 0
| Memory: installed 2048 Mb, allocated 182 Mb, free 1866 Mb
|-- 6
| CPUs: 0 1 3
| Lgroup resources: 1 2 4 (CPU); 1 2 4 (memory)
| `-- 2
| CPU: 1
| Memory: installed 1599 Mb, allocated 25 Mb, free 1574 Mb
|-- 7
| CPUs: 0 2 3
| Lgroup resources: 1 3 4 (CPU); 1 3 4 (memory)
| `-- 3
| CPU: 2
| Memory: installed 2048 Mb, allocated 131 Mb, free 1917 Mb
`-- 8
CPUs: 1-3
Lgroup resources: 2 3 4 (CPU); 2 3 4 (memory)
`-- 4
CPU: 3
Memory: installed 2048 Mb, allocated 26 Mb, free 2022 Mb
It turned out that the Perl glue code had another application - it could be used to write useful tests in Perl for the HLS implementation. Here is a simple example:
######################################################################
# Each lgrp other than root should have a single parent and
# root should have no parents.
$fail = 0;
foreach my $l (lgrp_lgrps($c)) {
next if $l == $root;
my (@parents) = $c->parents($l) or
diag("lgrp_parents: $!");
my $nparents = @parents;
$fail++ unless $nparents == 1;
}
is($fail, 0, 'All non-leaf lgrps should have single parent');
@parents = $c->parents($root);
ok(!@parents, 'root should have no parents');
Once the tool was ready, Steve Lau wrote a set of tests for it (also in Perl).
And now you can play with it as well - the Perl module and the resulting lgrpinfo command are now available on the OpenSolaris web site. they are also available through the CPAN network.