Tuesday August 23, 2005 Here I discuss new set of tools giving developers and users insight into the NUMA characteristics of their systems and applications and explicit control knobs to affect NUMA-related properties of their applications.
While software developers are very busy developing more and more "features" hardware designers are very successful in sustaining the Moore's Law and offsetting the impact of these features. Two orthogonal ways to improve system performance are
While both approaches are quite successful, increasing the number of CPUs proved to be the only way to dramatically improve the overall system performance.
Combining many individual CPUs together to form a computer system requires special hardware to coordinate these CPUs and other system components (memory, caches, I/O, etc). The coordination with the memory subsystem is especially important because
As a result, without careful design the fastest CPU in the world may spend its days waiting for data from memory. The problem is usually solved by introducing various levels of hardware caches which keep popular memory areas closer to the data-hungry CPU circuits.
Another problem in multiprocessor systems is sharing the memory bus that connects memory banks to the CPUs. Imagine a wide highway connecting several major cities. It may be a breeze to go from one city to another at non-peak times, but everyone knows what happens at rush hours when everyone hurries to their destination. The fast highway becomes a slow-crawling mess. The same thing may happen in a traditional Symmetric Multiprocessing (SMP) system with many CPUs sharing the memory bus. The usual way to solve this problem is to partition the system into smaller nodes where only the intra-node traffic requires access to the shared interconnect while data access within the node uses a private bus. This architecture is very similar to the big network where hubs provide fast access within a LAN and routers connect LANs together while containing all local traffic within the LAN. The major property of such partitioned systems is that accessing the data within the node is noticeably faster than accessing data outside the node. Such partitioned systems got the name NUMA which stands for Non-Uniform Memory Access.
Remember that CPUs try to get most of their data from caches rather than directly from memory, so a special hardware components are added to ensure that two CPUs on different nodes see the same piece of data even if the data is in L1 caches of these CPUs. In technical terms this hardware provides cache coherence and most NUMA systems include such cache-coherence protocols, so they are sometimes called ccNUMA systems.
Up until recently NUMA systems was primarily designed by big computer vendors developing really powerful computers. The list includes, but not limited to
But recently AMD introduced NUMA Opteron and Athlon64 platform which brings NUMA technology right on my desktop, thanks to its HyperTransport design. As a result, NUMA-awareness became the mainstream of OS design.
As you can see from the description above, all data movements within a node is much more efficient than intra-node data movements (that's why the whole thing is called Non-Uniform). Since applications usually don't care about low-level hardware architectures, it is up to the OS to make sure that application data stays close enough to its CPU. NUMA-aware OSes try hard to keep application data close to CPUs which massages this data.
The terminology in this area is not very well established and while some implementations use the term node, Solaris uses the concept of locality group or "lgroup" to refer to the collection of nearby resources. Locality groups form a hierarchy with each level representing the degree of "closeness" of resources. I recommend that you take your time to read a nice overview of NUMA implementation on Solaris which contains many explanations of Solaris-specific implementation details.
While OS is usually doing a good job at keeping data and CPUs close to each other, sometimes its good intentions come wrong and the result is quite the opposite meaning that almost all data accesses for some applications become remote, severely affecting performance of such application. This is often the case in scientific software when one thread allocates and prepares a large chunk of memory and then other threads use it extensively. The OS doesn't know that the thread that allocated memory is not the one that is going to use it, so it often keeps the memory close to the allocating thread instead of the thread that is actually using the memory. Of course, ideally the OS should detect such imbalance and automatically rearrange things, but in practice this problem is very hard. In such cases application developers or users may help OS correct the situation and provide additional hints. To do this effectively, users need to understand the NUMA properties of their systems and get some insights into the ways OS placed threads and their memory across CPU and memory boards. To correct things users also need tools to modify such placement. Software developers can use the Locality Group API.
The rest of this discussion will focus on such tools for Solaris operating environment. I assume that you checked the NUMA Observability tools page on the OpenSolaris site and downloaded the tools. We will discuss how to use
lgrpinfo(1): tool for displaying lgroup hierarchyplgrp(1): proc tool for observing and affecting lgroup affinitiespmadvise(1):
proc tool for applying advice with madvise(3C)pmap(1)
extensions option to display lgroup containing physical memory backing given
virtual address in specified processSolaris::Lgrp: Perl module which gives full
access to lgroup API to Perl scripts
Suppose that you are not happy with the performance of your favorite application
on Solaris and want to improve it by playing with the memory placement
optimization knobs. The first thing you would want to know is whether all this
discussion is relevant for you in the first place i.e. whether your system has
NUMA properties. Running lgrpinfo without any arguments answers
this question:
str2: $ lgrpinfo
lgroup 0 (root):
children: none
CPUs: 0 2
Memory: installed 512 Mb, allocated 275 Mb, free 237 Mb
Lgroup resources: 0 (CPU); 0 (memory)
This machine has only one (root) lgroup and is an UMA system.
Not very interesting for our discussion. This is 2-CPU Ultra 60 system. Let's take a look at another system:
sark: $ lgrpinfo
lgroup 0 (root):
children: 1 2 3
CPUs: 0-3 8-11 16-19
Memory: installed 24576 Mb, allocated 4608 Mb, free 19968 Mb
lgroup 1 (leaf):
children: none, parent: 0
CPUs: 0-3
Memory: installed 8192 Mb, allocated 4247 Mb, free 3945 Mb
lgroup 2 (leaf):
children: none, parent: 0
CPUs: 8-11
Memory: installed 8192 Mb, allocated 194 Mb, free 7998 Mb
lgroup 3 (leaf):
children: none, parent: 0
CPUs: 16-19
Memory: installed 8192 Mb, allocated 167 Mb, free 8025 Mb
This is a NUMA system with three nodes each having 8 Gb of memory and 4 CPUs. Now let us take a look at 4-CPU AMD Opteron box:
gears: $ lgrpinfo
lgroup 0 (root):
children: 3 4 6 8
CPUs: 0-3
Memory: installed 3647 Mb, allocated 394 Mb, free 3253 Mb
Lgroup resources: 1 2 5 7 (CPU); 1 2 (memory)
lgroup 1 (leaf):
children: none, parent: 3
CPU: 0
Memory: installed 2048 Mb, allocated 289 Mb, free 1759 Mb
Lgroup resources: 1 (CPU); 1 (memory)
lgroup 2 (leaf):
children: none, parent: 4
CPU: 1
Memory: installed 1599 Mb, allocated 104 Mb, free 1495 Mb
Lgroup resources: 2 (CPU); 2 (memory)
lgroup 3 (intermediate):
children: 1, parent: 0
CPUs: 0-2
Memory: installed 3647 Mb, allocated 394 Mb, free 3253 Mb
Lgroup resources: 1 2 5 (CPU); 1 2 (memory)
lgroup 4 (intermediate):
children: 2, parent: 0
CPUs: 0 1 3
Memory: installed 3647 Mb, allocated 394 Mb, free 3253 Mb
Lgroup resources: 1 2 7 (CPU); 1 2 (memory)
lgroup 5 (leaf):
children: none, parent: 6
CPU: 2
Lgroup resources: 5 (CPU);
lgroup 6 (intermediate):
children: 5, parent: 0
CPUs: 0 2 3
Memory: installed 2048 Mb, allocated 289 Mb, free 1759 Mb
Lgroup resources: 1 5 7 (CPU); 1 (memory)
lgroup 7 (leaf):
children: none, parent: 8
CPU: 3
Lgroup resources: 7 (CPU);
lgroup 8 (intermediate):
children: 7, parent: 0
CPUs: 1-3
Memory: installed 1599 Mb, allocated 104 Mb, free 1495 Mb
Lgroup resources: 2 5 7 (CPU); 2 (memory)
We can see that it has 4 leaf nodes, 4 intermediate nodes and a root node. Pretty much what we would expect from a 4-way AMD Opteron. Adding some options we can add some eye candy to the output:
gears: $ lgrpinfo -Ta
0
|-- 3
| CPUs: 0-2
| Lgroup resources: 1 2 5 (CPU); 1 2 (memory)
| `-- 1
| CPU: 0
| Memory: installed 2048 Mb, allocated 291 Mb, free 1757 Mb
|-- 4
| CPUs: 0 1 3
| Lgroup resources: 1 2 7 (CPU); 1 2 (memory)
| `-- 2
| CPU: 1
| Memory: installed 1599 Mb, allocated 103 Mb, free 1496 Mb
|-- 6
| CPUs: 0 2 3
| Lgroup resources: 1 5 7 (CPU); 1 (memory)
| `-- 5
| CPU: 2
`-- 8
CPUs: 1-3
Lgroup resources: 2 5 7 (CPU); 2 (memory)
`-- 7
CPU: 3
This output shows the lgroup hierarchy in a more obvious way. We can immediately see that this system is rather interesting - CPU 0 has 2Gb of local memory, CPU 1 has 1.5 Gb of local memory and CPUs 2 and 3 have no local memory at all. Suppose that an application is running on CPU 2 and needs a page of memory. Since there is no local memory available, the system will walk up the hierarchy and will try to allocate memory from lgroup 6. Here is the description of lgroup 6:
|-- 6
| CPUs: 0 2 3
| Lgroup resources: 1 5 7 (CPU); 1 (memory)
| `-- 5
| CPU: 2
We can see that it will try to get memory from lgroup 1 which corresponds to the node containing CPU 0. Similarly, application homed to CPU 3 will usually get its memory from lgroup 2 which corresponds to a node with CPU 1. This all makes perfect sense if you look at the AMD opteron topology:
2----3
| |
| |
0----1
This picture shows that CPU 2 is closest to CPUs 0 and 3 and CPU 3 is closest to 2 and 1.
These topology considerations mean that if a thread is running homed to CPU 0
and the load on this CPU becomes too high, the OS will try to migrate the thread
to the closest CPU - in this case either 2 or 1. This way the OS tries
to keep migrated threads still close to their home memory. We can see this from
the lgrpinfo output above:
|-- 3
| CPUs: 0-2
| Lgroup resources: 1 2 5 (CPU); 1 2 (memory)
| `-- 1
| CPU: 0
| Memory: installed 2048 Mb, allocated 291 Mb, free 1757 Mb
The CPU 0 belongs to lgroup 1 which has lgroup 3 as its parent. The lgroup 3 has lgroups 1 2 and 5 as its CPU resources. Lgroup 1 contains CPU 0 itself, lgroup 2 contains CPU 1 and lgroup 5 contains CPU 2.
Armed with this information we can predict that it is best to run memory-intensive application on CPU 0 or CPU 1 (which is good enough if the working set fits the 1.5 Gb available on CPU1). The CPU-intensive application without much memory needs may do well on CPUs 2 and 3.
The lgrpinfo command has a useful -l option which
shows the latency for memory access between different lgroups. Here is
an example from the same 4-CPU AMD opteron machine:
gears: $ lgrpinfo -l
Lgroup latencies:
0 1 2 3 4 5 6 7 8
0 135 135 135 135 135 - 135 - 135
1 99 66 99 99 99 - 66 - 99
2 99 99 66 99 99 - 99 - 66
3 135 99 135 99 135 - 99 - 135
4 135 135 99 135 99 - 135 - 99
5 135 99 135 135 135 - 99 - 135
6 135 135 135 135 135 - 99 - 135
7 135 135 99 135 135 - 135 - 99
8 135 135 135 135 135 - 135 - 99
We can see that remote access takes about 1.5 times more time than local access. The absolute value of the table is not very interesting, but the relative values provide a good approximation of the access times. This data is collected by the system during boot.
When a new thread is created by the system, it examines load averages on each
CPU and assigns the thread a home lgroup (which is almost always a leaf lgroup).
The scheduler tries to run the thread at its home lgroup or as close to it as
possible. A thread will migrate from its home lgroup if there is too
much imbalance between CPU loads. Note that a thread that is bound to a
specific CPU will always run on this CPU. A thread bound to a processor set will
run on its home lgroup or a closest CPU within the processor set. You can
determine the home lgroup of a process or a thread by calling lgrp_home(3LGRP)
function within an application. Or you can determine the home lgroup of every
thread in a process using the
plgrp utility. Here, for example, we figure out that
sendmail(1M) runs in lgroup 7:
# plgrp -G `pgrep sendmail`
7
# lgrpinfo 7
lgroup 7 (leaf):
children: none, parent: 8
CPU: 3
Lgroup resources: 7 (CPU);
We can see that sendmail is running on CPU 3 which doesn't have any attached memory. If we are concerned about its performance, we can manually move it close to its memory (remember from our previous discussion that the memory is allocated from lgroup 2):
# plgrp -S 2 `pgrep sendmail`
# plgrp -G `pgrep sendmail`
2
What if a process has several threads? For example, for
automountd(1M) we get:
# plgrp -G `pgrep automountd`
7
2
5
We can get information about specific threads:
# plgrp -vG `pgrep automountd`
100473/1: 7
100473/2: 2
100473/4899: 5
The 100473 is the pid of the process and the numbers 1 2 and 4899 are thread IDs. They are useful because we can move each thread individually to a new home:
# plgrp -S 1 100473/1
# plgrp -vG `pgrep automountd`
100473/1: 1
100473/2: 2
Or move all of them at once:
# plgrp -S 1 `pgrep automountd`
# plgrp -vG `pgrep automountd`
100473/1: 1
100473/2: 1
When we move a process or a thread to a new home all its belongings (memory)
stays at its old home. Only new memory allocations use the memory from the new
home. The first question we may ask is what does the current memory allocation
looks like. Grabbing the modified version of pmap(1)
command from the toolkit we can easily determine where each memory segment is
llocated, using the new -L option:
$ plgrp -G `pgrep emacs`
2
$ pmap -L `pgrep emacs`
07FEB000 372K rwx-- 2 [ stack ]
08050000 16K r-x-- 2 /opt/csw/bin/emacs-21.3
08054000 20K r-x-- 1 /opt/csw/bin/emacs-21.3
0838E000 1652K rwx-- 2 /opt/csw/bin/emacs-21.3
0852B000 2900K rwx-- 2 [ heap ]
08800000 3124K rwx-- 2 [ heap ]
BF960000 4K rwx-- 2 [ anon ]
BF970000 24K r-x-- 1 /lib/libm.so.2
BF976000 4K r-x-- 2 /lib/libm.so.2
...
BF9DE000 4K rwxs- 2 [ anon ]
BF9F0000 12K rwx-- 2 [ anon ]
BF9F4000 8K rwx-- 2 [ anon ]
BFA00000 4K rwx-- 2 [ anon ]
BFA10000 4K rwx-- 2 [ anon ]
BFA20000 128K r-x-- 1 /lib/libc.so.1
BFA42000 28K r-x-- 1 /lib/libc.so.1
BFA49000 16K r-x-- 2 /lib/libc.so.1
...
We can see that most of the text segment and all the heap is allocated from the home lgroup. The segments for the library code (libm and libc) is spread around - the library code is shared and its allocation depends on which process allocated it first.
How can we control the memory placement? Application writers can use madvise(3C)
library call to control it from the application itself. It is especially useful
for applications which allocate and initialize memory from one thread and use it
by another one. The user can also use madv.so.1
library to apply memory advise before starting the application. For an already
running application developer can use the pmadvise(1)
tool from the toolkit. The following example shows how we can move Emacs to a
new home and advise it to take along move the memory it uses:
$ plgrp -S 1 `pgrep emacs`
$ pmadvise -o heap=access_lwp,stack=access_lwp `pgrep emacs`
... Play with Emacs a bit ...
$ pmap -L `pgrep emacs` | egrep '(heap|stack)'
$ pmap -L `pgrep emacs` | egrep '(heap|stack)' 08017000 188K rwx-- 1 [ stack ]
08017000 188K rwx-- 1 [ stack ]
08046000 8K rwx-- 2 [ stack ]
0852B000 32K rwx-- 1 [ heap ]
08534000 76K rwx-- 1 [ heap ]
08549000 4K rwx-- 1 [ heap ]
0854B000 20K rwx-- 1 [ heap ]
08551000 16K rwx-- 1 [ heap ]
...
Almost all the stack and all the heap migrated to a new home! We successfully completed the move with just a few commands!
Of course, we can't talk about Solaris without mentioning the almighty DTrace. We have all the tools to control long-running applications, but how can we enforce the policy automatically whenever an application starts? DTrace can this problem, of course! Let us create the following D program and call it papply.d:
#!/usr/sbin/dtrace -qws
/* We use -w to allow destructive actions */
/*
* When the process reaches main(), stop it, re-home and restart again.
*/
pid$target::main:entry
{
stop();
system("plgrp -S 1 %d", $target);
system("pmadvise -o heap=access_lwp,stack=access_lwp %d", $target);
system("prun %d", $target);
}
Now we can do things like
$ apply.d -c emacs
Unfortunately, this doesn't work since plgrp
can't deal with a process which is already traced,
although pmap and pmadvise can if
-F option is supplied. The next
version of the plgrp will include
-F option which will fix the problem. Stay tuned!
In my
earlier entry I discussed the issues with flow control. This time I want to
look at the module which actually looks at the data being passed and will
convert all output to upper case letters. Basically, it will do the same thing
as tr a-z A-Z command does, but using STREAMS
upmod_upcase() function that
examines all M_DATA messages in the mblk and converts every symbol to upper
case:
#define islower(x) (((unsigned)(x) >= 'a') && ((unsigned)(x) <= 'z'))
#define isupper(x) (((unsigned)(x) >= 'A') && ((unsigned)(x) <= 'Z'))
#define toupper(x) (isupper(x) ? (x) : (unsigned)(x) - 'a' + 'A')
/*
* Convert all ASCII chars in data blocks to upper case
*/
static mblk_t *
upmod_upcase(mblk_t *passed_mp)
{
mblk_t *mp = passed_mp;
for (; mp != NULL; mp = mp->b_cont) {
if ((DB_TYPE(mp) == M_DATA) && (MBLKL(mp) > 0)) {
unsigned char *p;
for (p = mp->b_rptr; p < mp->b_wptr; p++)
if (islower(*p))
*p = toupper(*p);
}
}
return (passed_mp);
}
The DB_TYPE(mp) macro simply returns
mp->b_datab->db_type value and MBLKL(mp) is the
amount of data between the read and write pointers . These macros together
with some other useful definitions are defined in
sys/strsun.h file:
#define DB_BASE(mp) ((mp)->b_datap->db_base) #define DB_LIM(mp) ((mp)->b_datap->db_lim) #define DB_REF(mp) ((mp)->b_datap->db_ref) #define DB_TYPE(mp) ((mp)->b_datap->db_type) #define MBLKL(mp) ((mp)->b_wptr - (mp)->b_rptr) #define MBLKSIZE(mp) ((mp)->b_datap->db_lim - (mp)->b_datap->db_base) #define MBLKHEAD(mp) ((mp)->b_rptr - (mp)->b_datap->db_base) #define MBLKTAIL(mp) ((mp)->b_datap->db_lim - (mp)->b_wptr) #define MBLKIN(mp, off, len) (((off) <= MBLKL(mp)) && \ (((mp)->b_rptr + (off) + (len)) <= (mp)->b_wptr))
Now we can modify the read-side put procedure to call
upmod_upcase for each mblock seen on input.
static void
upmodrput(queue_t *q, mblk_t *mp)
{
upmodput(q, upmod_upcase(mp));
}
Here is the full example:
/* * This example demonstrates a minimum STREAMS module that honors flow control. * It converts all data bytes on the read side to the upper case. */ /* * Required include files. */ #include#include #include #include #include #include #include /* * Function prototypes. */ static int upmodopen(queue_t *, dev_t *, int, int, cred_t *); static int upmodclose(queue_t *); static void upmodput(queue_t *, mblk_t *); static void upmodrput(queue_t *, mblk_t *); static void upmodsrv(queue_t *); static mblk_t *upmod_upcase(mblk_t *mp); /* * Module linkage data */ static struct module_info upmod_minfo = { 2, /* mi_idnum */ "upmod", /* mi_idname */ 0, /* mi_minpsz */ INFPSZ, /* mi_maxpsz */ 4096, /* mi_hiwat */ 512 /* mi_lowat */ }; static struct qinit upmod_rinit = { (int (*)())upmodput, /* qi_putp */ (int (*)())upmodsrv, /* qi_srvp */ upmodopen, /* qi_qopen */ upmodclose, /* qi_qclose */ NULL, /* qi_qadmin */ &upmod_minfo, /* qi_minfo */ }; static struct qinit upmod_winit = { (int (*)())upmodrput, /* qi_putp */ (int (*)())upmodsrv, /* qi_srvp */ NULL, /* qi_qopen */ NULL, /* qi_qclose */ NULL, /* qi_qadmin */ &upmod_minfo, /* qi_minfo */ }; static struct streamtab upmod_info = { &upmod_rinit, /* st_rdinit */ &upmod_winit, /* st_wrinit */ }; static struct fmodsw fsw = { "upmod", &upmod_info, D_MP | D_MTPERQ }; /* * Module linkage information for the kernel. */ struct mod_ops mod_strmodops; static struct modlstrmod modlstrmod = { &mod_strmodops, "Example up-through module 1.0", &fsw }; static struct modlinkage modlinkage = { MODREV_1, (void *)&modlstrmod, NULL }; /* * Standard module entry points. */ int _init(void) { return (mod_install(&modlinkage)); } int _fini(void) { return (mod_remove(&modlinkage)); } int _info(struct modinfo *modinfop) { return (mod_info(&modlinkage, modinfop)); } /* * Actual module code. */ /* * STREAMS entry points. */ /* ARGSUSED */ static int upmodopen(queue_t *rq, dev_t *dev, int oflag, int sflag, cred_t *crp) { if (sflag != MODOPEN) return (EINVAL); /* Prevent duplicate opens */ if (rq->q_ptr != NULL) return (0); rq->q_ptr = WR(rq)->q_ptr = (void *)1; qprocson(rq); /* * At this point module is linked in the STREAM and can send/receive * messages. Its put/service procedures may execute at any time. */ return (0); } static int upmodclose(queue_t *rq) { qprocsoff(rq); rq->q_ptr = WR(rq)->q_ptr = NULL; /* * At this point module is disconnected from the STREAM and can * no longer receive messages. Its put or service procedures are not * running. */ return (0); } /* * Support routines. */ /* Put procedure */ static void upmodput(queue_t *q, mblk_t *mp) { /* * If the message is a high-priority message or there is no flow control * and there are no messages in the queue already, pass it forward, * otherwise queue. */ if (queclass(mp) == QPCTL || ((q->q_first == NULL) && canputnext(q))) putnext(q, mp); else (void) putq(q, mp); } /* * Support routines. */ static void upmodrput(queue_t *q, mblk_t *mp) { upmodput(q, upmod_upcase(mp)); } /* Read/write side service routine */ static void upmodsrv(queue_t *q) { mblk_t *mp; /* * Get messages from the service queue and pass them forward until flow * controlled. */ while ((mp = getq(q)) != NULL) { if (canputnext(q)) { putnext(q, mp); } else { (void) putbq(q, mp); break; } } } #ifndef islower #define islower(x) (((unsigned)(x) >= 'a') && ((unsigned)(x) <= 'z')) #endif #ifndef isupper #define isupper(x) (((unsigned)(x) >= 'A') && ((unsigned)(x) <= 'Z')) #endif #ifndef toupper #define toupper(x) (isupper(x) ? (x) : (unsigned)(x) - 'a' + 'A') #endif /* * Convert all ASCII chars in data blocks to upper case */ static mblk_t * upmod_upcase(mblk_t *passed_mp) { mblk_t *mp = passed_mp; for (; mp != NULL; mp = mp->b_cont) { if ((DB_TYPE(mp) == M_DATA) && (MBLKL(mp) > 0)) { uchar_t *p; for (p = mp->b_rptr; p < mp->b_wptr; p++) if (islower(*p)) *p = toupper(*p); } } return (passed_mp); }
Let us save it in file upmod.c, and compile it:
$ /usr/sfw/bin/gcc -c -m64 -o upmod.o -I/usr/include \
-O -D_KERNEL -D_SYSCALL32 -D_SYSCALL32_IMPL upmod.c
$ ld -r -o upmod upmod.o
Now we can install it. For example, on sparc system:
$ su
# cp upmod /kernel/strmod/sparcv9
# exit
$ strchg -h upmod
$ UPTIME
4:53PM UP 35 DAY(S), 4:38, 2 USERS, LOAD AVERAGE: 0.01, 0.01, 0.00
USER TTY LOGIN@ IDLE JCPU PCPU WHAT
USER PTS/1 8AUG0510DAYS 8 BASH
USER PTS/2 4:11PM 12 1 W
$ STRCHG -P
$ w
4:55pm up 35 day(s), 4:39, 2 users, load average: 0.00, 0.01, 0.00
User tty login@ idle JCPU PCPU what
user pts/1 8Aug0510days 8 bash
user pts/2 4:11pm 12 1 w
The output reminds me of the old Russian-made mainframes ES 1045 (Soviet clone of IBM 360/370 series) which could only output in all caps. Interestingly, at these times Russian-made computers usually used capital letters for English and lower letters for Russian. This was a precursor for KOI8-r encoding.
I want to understand my system
About a year ago I joined the NUMA project which was providing support for hierarchical locality groups (HLS). Solaris already understood two-level memory hierarchies when a piece of memory can be local or remote to a certain CPU. The project elaborated on this concept and brought a more fine-grained distance information. Now scheduler and VM subsystem could discover what is the closest memory or CPU, what is the one a bit further away, what is the one even further, etc. This is especially important for AMD Opteron machines with their HyperTransport architecture. For example, on a 4-way AMD system it may take up to two hops for a CPU to get to a memory on another node.
All this notion of locality is expressed in an abstraction called locality group or lgroup for short. The lgroup is just a set of resources (memory and CPU) that are not too far from each other. They form a hierarchy with leafs containing end-node resources and the root lgroup at the top containing all the system resources. Jonathan wrote a very good introduction explaining all the details and it has some very cool pictures, too. I strongly recommend reading it if you are interested in the way Solaris deals with NUMA challenges.
I came from a different towns and willages of Solaris and didn't know anything about these locality groups. Some of you know the feeling when you join a completely new project and everyone speaks a foreign language. To get a handle on the terminology and the basic abstractions I started with improving the little MDB support available for displaying lgroups. It basically allows you to have a quick look at all lgroups and the output looks like this (on a 4-CPU AMD box):
> ::lgrp
LGRPID ADDR PARENT PLATHAND #CPU CPUS
0 fffffffffbc1ccc0 0 DEFAULT 0
1 fffffffffbc0a2a0 fffffffffbc0a380 0 1 0
2 fffffffffbc0a310 fffffffffbc0a3f0 1 1 1
3 fffffffffbc0a380 fffffffffbc1ccc0 NULL 0
4 fffffffffbc0a3f0 fffffffffbc1ccc0 NULL 0
5 fffffffffbc0a460 fffffffffbc0a4d0 2 1 2
6 fffffffffbc0a4d0 fffffffffbc1ccc0 NULL 0
7 fffffffffbc0a540 fffffffffbc0a5b0 3 1 3
8 fffffffffbc0a5b0 fffffffffbc1ccc0 NULL 0
You can get a bit more details:
{
lgrp_id = 0
lgrp_latency = 0x86
lgrp_plathand = 0xbabecafe
lgrp_parent = 0
lgrp_reserved1 = 0
lgrp_childcnt = 0x4
lgrp_children = 0x158
lgrp_leaves = 0xa6
lgrp_set = [ 0xa6, 0x6 ]
lgrp_mnodes = 0x3
lgrp_nmnodes = 0x2
lgrp_reserved2 = 0
lgrp_cpu = 0
lgrp_cpucnt = 0
lgrp_chipcnt = 0
lgrp_chips = 0
lgrp_kstat = 0
}
{
lgrp_id = 0x1
lgrp_latency = 0x42
lgrp_plathand = 0
lgrp_parent = lgrp_space+0xe0
lgrp_reserved1 = 0
lgrp_childcnt = 0
lgrp_children = 0
lgrp_leaves = 0x2
lgrp_set = [ 0x2, 0x2 ]
lgrp_mnodes = 0x1
lgrp_nmnodes = 0x1
lgrp_reserved2 = 0
lgrp_cpu = cpus
lgrp_cpucnt = 0x1
lgrp_chipcnt = 0x1
lgrp_chips = cpu0_chip
lgrp_kstat = 0xffffffff839d35e0
}
...
You got the idea. It is quite useful for engineers who have to debug this staff, but a bit obscure for most other people. So I was wondering how could I see these lgroups in a simple manner without resorting to the kernel debugger. Luckily Jonathan created a very useful library which provided all the information need at the user level. The actual thing missing was an actual application displaying these information in human-readable form. I had enough motivation to write such program, but I really wanted a simple tool to play with since I didn't know what it was going to display and Perl was a good tool to play with prototypes. The only missing part was a glue code needed to access the C library from Perl application. This was a perfect chance to dive in the underworld of XS and XSUB and I described my experience before.
It turned out that the glue code in this case was relatively easy, once I figured out how to get around some of the h2xs roadblocks. After that the Perl part was really easy - the first prototype was about a page of Perl code. For example, the wollowing piece of code was producing the full list of lgroups in the system, or any subtree:
sub lgrp_lgrps($;$)
{
my $cookie = shift;
my $root = shift;
$root = lgrp_root($cookie) unless defined $root;
return unless defined $root;
my @children = lgrp_children($cookie, $root);
my @result;
# Concatenate root with subtrees for every children.
# Every subtree is obtained by calling lgrp_lgrps recursively with each of
# the children as the argument.
@result = @children ?
($root, map {lgrp_lgrps($cookie, $_)} @children) :
($root);
return (wantarray ? @result : scalar @result);
}
Once I had the glue code and the initially prototype it was relatively easy
to write a small program
that displayed the system lgroup hierarchy in a nice form. Here is a
different look at the same hierarchy as you saw below:
$ lgrpinfo -Ta
0
|-- 5
| CPUs: 0-2
| Lgroup resources: 1 2 3 (CPU); 1 2 3 (memory)
| `-- 1
| CPU: 0
| Memory: installed 2048 Mb, allocated 182 Mb, free 1866 Mb
|-- 6
| CPUs: 0 1 3
| Lgroup resources: 1 2 4 (CPU); 1 2 4 (memory)
| `-- 2
| CPU: 1
| Memory: installed 1599 Mb, allocated 25 Mb, free 1574 Mb
|-- 7
| CPUs: 0 2 3
| Lgroup resources: 1 3 4 (CPU); 1 3 4 (memory)
| `-- 3
| CPU: 2
| Memory: installed 2048 Mb, allocated 131 Mb, free 1917 Mb
`-- 8
CPUs: 1-3
Lgroup resources: 2 3 4 (CPU); 2 3 4 (memory)
`-- 4
CPU: 3
Memory: installed 2048 Mb, allocated 26 Mb, free 2022 Mb
It turned out that the Perl glue code had another application - it could be used to write useful tests in Perl for the HLS implementation. Here is a simple example:
######################################################################
# Each lgrp other than root should have a single parent and
# root should have no parents.
$fail = 0;
foreach my $l (lgrp_lgrps($c)) {
next if $l == $root;
my (@parents) = $c->parents($l) or
diag("lgrp_parents: $!");
my $nparents = @parents;
$fail++ unless $nparents == 1;
}
is($fail, 0, 'All non-leaf lgrps should have single parent');
@parents = $c->parents($root);
ok(!@parents, 'root should have no parents');
Once the tool was ready, Steve Lau wrote a set of tests for it (also in Perl).
And now you can play with it as well - the Perl module and the resulting lgrpinfo command are now available on the OpenSolaris web site. they are also available through the CPAN network.