Weblog

All | CMT | General | NUMA | OpenSolaris | Perl | Photo | Programmers Desk | STREAMS
« Previous day (Aug 5, 2005) | Main | Next day (Aug 7, 2005) »
20050806 Saturday August 06, 2005

I want to understand my system

About a year ago I joined the NUMA project which was providing support for hierarchical locality groups (HLS). Solaris already understood two-level memory hierarchies when a piece of memory can be local or remote to a certain CPU. The project elaborated on this concept and brought a more fine-grained distance information. Now scheduler and VM subsystem could discover what is the closest memory or CPU, what is the one a bit further away, what is the one even further, etc. This is especially important for AMD Opteron machines with their HyperTransport architecture. For example, on a 4-way AMD system it may take up to two hops for a CPU to get to a memory on another node.

All this notion of locality is expressed in an abstraction called locality group or lgroup for short. The lgroup is just a set of resources (memory and CPU) that are not too far from each other. They form a hierarchy with leafs containing end-node resources and the root lgroup at the top containing all the system resources. Jonathan wrote a very good introduction explaining all the details and it has some very cool pictures, too. I strongly recommend reading it if you are interested in the way Solaris deals with NUMA challenges.

I came from a different towns and willages of Solaris and didn't know anything about these locality groups. Some of you know the feeling when you join a completely new project and everyone speaks a foreign language. To get a handle on the terminology and the basic abstractions I started with improving the little MDB support available for displaying lgroups. It basically allows you to have a quick look at all lgroups and the output looks like this (on a 4-CPU AMD box):


> ::lgrp
   LGRPID             ADDR           PARENT         PLATHAND      #CPU      CPUS
        0 fffffffffbc1ccc0                0          DEFAULT         0
        1 fffffffffbc0a2a0 fffffffffbc0a380                0         1      0
        2 fffffffffbc0a310 fffffffffbc0a3f0                1         1      1
        3 fffffffffbc0a380 fffffffffbc1ccc0             NULL         0
        4 fffffffffbc0a3f0 fffffffffbc1ccc0             NULL         0
        5 fffffffffbc0a460 fffffffffbc0a4d0                2         1      2
        6 fffffffffbc0a4d0 fffffffffbc1ccc0             NULL         0
        7 fffffffffbc0a540 fffffffffbc0a5b0                3         1      3
        8 fffffffffbc0a5b0 fffffffffbc1ccc0             NULL         0

You can get a bit more details:


{
    lgrp_id = 0
    lgrp_latency = 0x86
    lgrp_plathand = 0xbabecafe
    lgrp_parent = 0
    lgrp_reserved1 = 0
    lgrp_childcnt = 0x4
    lgrp_children = 0x158
    lgrp_leaves = 0xa6
    lgrp_set = [ 0xa6, 0x6 ]
    lgrp_mnodes = 0x3
    lgrp_nmnodes = 0x2
    lgrp_reserved2 = 0
    lgrp_cpu = 0
    lgrp_cpucnt = 0
    lgrp_chipcnt = 0
    lgrp_chips = 0
    lgrp_kstat = 0
}
{
    lgrp_id = 0x1
    lgrp_latency = 0x42
    lgrp_plathand = 0
    lgrp_parent = lgrp_space+0xe0
    lgrp_reserved1 = 0
    lgrp_childcnt = 0
    lgrp_children = 0
    lgrp_leaves = 0x2
    lgrp_set = [ 0x2, 0x2 ]
    lgrp_mnodes = 0x1
    lgrp_nmnodes = 0x1
    lgrp_reserved2 = 0
    lgrp_cpu = cpus
    lgrp_cpucnt = 0x1
    lgrp_chipcnt = 0x1
    lgrp_chips = cpu0_chip
    lgrp_kstat = 0xffffffff839d35e0
}
...

You got the idea. It is quite useful for engineers who have to debug this staff, but a bit obscure for most other people. So I was wondering how could I see these lgroups in a simple manner without resorting to the kernel debugger. Luckily Jonathan created a very useful library which provided all the information need at the user level. The actual thing missing was an actual application displaying these information in human-readable form. I had enough motivation to write such program, but I really wanted a simple tool to play with since I didn't know what it was going to display and Perl was a good tool to play with prototypes. The only missing part was a glue code needed to access the C library from Perl application. This was a perfect chance to dive in the underworld of XS and XSUB and I described my experience before.

It turned out that the glue code in this case was relatively easy, once I figured out how to get around some of the h2xs roadblocks. After that the Perl part was really easy - the first prototype was about a page of Perl code. For example, the wollowing piece of code was producing the full list of lgroups in the system, or any subtree:

sub lgrp_lgrps($;$)
{
    my $cookie = shift;
    my $root = shift;

    $root = lgrp_root($cookie) unless defined $root;
    return unless defined $root;

    my @children = lgrp_children($cookie, $root);
    my @result;

    # Concatenate root with subtrees for every children.
    # Every subtree is obtained by calling lgrp_lgrps recursively with each of
    # the children as the argument.
    @result = @children ?
      ($root, map {lgrp_lgrps($cookie, $_)} @children) :
	($root);
    return (wantarray ? @result : scalar @result);
}

Once I had the glue code and the initially prototype it was relatively easy to write a small program that displayed the system lgroup hierarchy in a nice form. Here is a different look at the same hierarchy as you saw below:

  $ lgrpinfo -Ta
  0
  |-- 5
  |   CPUs: 0-2
  |   Lgroup resources:  1 2 3 (CPU); 1 2 3 (memory)
  |   `-- 1
  |       CPU: 0
  |       Memory: installed 2048 Mb,  allocated 182 Mb,  free 1866 Mb
  |-- 6
  |   CPUs: 0 1 3
  |   Lgroup resources:  1 2 4 (CPU); 1 2 4 (memory)
  |   `-- 2
  |       CPU: 1
  |       Memory: installed 1599 Mb,  allocated 25 Mb,  free 1574 Mb
  |-- 7
  |   CPUs: 0 2 3
  |   Lgroup resources:  1 3 4 (CPU); 1 3 4 (memory)
  |   `-- 3
  |       CPU: 2
  |       Memory: installed 2048 Mb,  allocated 131 Mb,  free 1917 Mb
  `-- 8
      CPUs: 1-3
      Lgroup resources:  2 3 4 (CPU); 2 3 4 (memory)
      `-- 4
          CPU: 3
          Memory: installed 2048 Mb,  allocated 26 Mb,  free 2022 Mb

It turned out that the Perl glue code had another application - it could be used to write useful tests in Perl for the HLS implementation. Here is a simple example:

######################################################################
# Each lgrp other than root should have a single parent and
# root should have no parents.
$fail = 0;
foreach my $l (lgrp_lgrps($c)) {
    next if $l == $root;
    my (@parents) = $c->parents($l) or
	diag("lgrp_parents: $!");
    my $nparents = @parents;
    $fail++ unless $nparents == 1;
}
is($fail, 0, 'All non-leaf lgrps should have single parent');
@parents = $c->parents($root);
ok(!@parents, 'root should have no parents');

Once the tool was ready, Steve Lau wrote a set of tests for it (also in Perl).

And now you can play with it as well - the Perl module and the resulting lgrpinfo command are now available on the OpenSolaris web site. they are also available through the CPAN network.


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

( Aug 06 2005, 12:28:08 AM PDT ) Permalink Comments [0]

Calendar

RSS Feeds

Search

Links

Navigation

Referers