Weblog

All | CMT | General | NUMA | OpenSolaris | Perl | Photo | Programmers Desk | STREAMS
« STREAMS: tr a-z A-Z | Main | Get to the Core! »
20050823 Tuesday August 23, 2005

NUMA Observability Tools

Observing and improving memory placement

Abstract

Here I discuss new set of tools giving developers and users insight into the NUMA characteristics of their systems and applications and explicit control knobs to affect NUMA-related properties of their applications.

Introduction and terminology

While software developers are very busy developing more and more "features" hardware designers are very successful in sustaining the Moore's Law and offsetting the impact of these features. Two orthogonal ways to improve system performance are

  1. Increasing processor speed (improving the productivity)
  2. Increasing the number of CPUs (increasing the amount of resources dedicated to a task).

While both approaches are quite successful, increasing the number of CPUs proved to be the only way to dramatically improve the overall system performance.

Combining many individual CPUs together to form a computer system requires special hardware to coordinate these CPUs and other system components (memory, caches, I/O, etc). The coordination with the memory subsystem is especially important because

As a result, without careful design the fastest CPU in the world may spend its days waiting for data from memory. The problem is usually solved by introducing various levels of hardware caches which keep popular memory areas closer to the data-hungry CPU circuits.

Another problem in multiprocessor systems is sharing the memory bus that connects memory banks to the CPUs. Imagine a wide highway connecting several major cities. It may be a breeze to go from one city to another at non-peak times, but everyone knows what happens at rush hours when everyone hurries to their destination. The fast highway becomes a slow-crawling mess. The same thing may happen in a traditional Symmetric Multiprocessing (SMP) system with many CPUs sharing the memory bus. The usual way to solve this problem is to partition the system into smaller nodes where only the intra-node traffic requires access to the shared interconnect while data access within the node uses a private bus. This architecture is very similar to the big network where hubs provide fast access within a LAN and routers connect LANs together while containing all local traffic within the LAN. The major property of such partitioned systems is that accessing the data within the node is noticeably faster than accessing data outside the node. Such partitioned systems got the name NUMA which stands for Non-Uniform Memory Access.

Remember that CPUs try to get most of their data from caches rather than directly from memory, so a special hardware components are added to ensure that two CPUs on different nodes see the same piece of data even if the data is in L1 caches of these CPUs. In technical terms this hardware provides cache coherence and most NUMA systems include such cache-coherence protocols, so they are sometimes called ccNUMA systems.

Up until recently NUMA systems was primarily designed by big computer vendors developing really powerful computers. The list includes, but not limited to

But recently AMD introduced NUMA Opteron and Athlon64 platform which brings NUMA technology right on my desktop, thanks to its HyperTransport design. As a result, NUMA-awareness became the mainstream of OS design.

As you can see from the description above, all data movements within a node is much more efficient than intra-node data movements (that's why the whole thing is called Non-Uniform). Since applications usually don't care about low-level hardware architectures, it is up to the OS to make sure that application data stays close enough to its CPU. NUMA-aware OSes try hard to keep application data close to CPUs which massages this data.

The terminology in this area is not very well established and while some implementations use the term node, Solaris uses the concept of locality group or "lgroup" to refer to the collection of nearby resources. Locality groups form a hierarchy with each level representing the degree of "closeness" of resources. I recommend that you take your time to read a nice overview of NUMA implementation on Solaris which contains many explanations of Solaris-specific implementation details.

While OS is usually doing a good job at keeping data and CPUs close to each other, sometimes its good intentions come wrong and the result is quite the opposite meaning that almost all data accesses for some applications become remote, severely affecting performance of such application. This is often the case in scientific software when one thread allocates and prepares a large chunk of memory and then other threads use it extensively. The OS doesn't know that the thread that allocated memory is not the one that is going to use it, so it often keeps the memory close to the allocating thread instead of the thread that is actually using the memory. Of course, ideally the OS should detect such imbalance and automatically rearrange things, but in practice this problem is very hard. In such cases application developers or users may help OS correct the situation and provide additional hints. To do this effectively, users need to understand the NUMA properties of their systems and get some insights into the ways OS placed threads and their memory across CPU and memory boards. To correct things users also need tools to modify such placement. Software developers can use the Locality Group API.

The rest of this discussion will focus on such tools for Solaris operating environment. I assume that you checked the NUMA Observability tools page on the OpenSolaris site and downloaded the tools. We will discuss how to use

System information

Suppose that you are not happy with the performance of your favorite application on Solaris and want to improve it by playing with the memory placement optimization knobs. The first thing you would want to know is whether all this discussion is relevant for you in the first place i.e. whether your system has NUMA properties. Running lgrpinfo without any arguments answers this question:


str2: $ lgrpinfo 
lgroup 0 (root):
        children: none
        CPUs: 0 2
        Memory: installed 512 Mb,  allocated 275 Mb,  free 237 Mb
        Lgroup resources:  0 (CPU); 0 (memory)

This machine has only one (root) lgroup and is an UMA system.

Not very interesting for our discussion. This is 2-CPU Ultra 60 system. Let's take a look at another system:


sark: $ lgrpinfo
lgroup 0 (root):
        children: 1 2 3
        CPUs: 0-3 8-11 16-19
        Memory: installed 24576 Mb,  allocated 4608 Mb,  free 19968 Mb
lgroup 1 (leaf):
        children: none, parent: 0
        CPUs: 0-3
        Memory: installed 8192 Mb,  allocated 4247 Mb,  free 3945 Mb
lgroup 2 (leaf):
        children: none, parent: 0
        CPUs: 8-11
        Memory: installed 8192 Mb,  allocated 194 Mb,  free 7998 Mb
lgroup 3 (leaf):
        children: none, parent: 0
        CPUs: 16-19
        Memory: installed 8192 Mb,  allocated 167 Mb,  free 8025 Mb

This is a NUMA system with three nodes each having 8 Gb of memory and 4 CPUs. Now let us take a look at 4-CPU AMD Opteron box:


gears: $ lgrpinfo
lgroup 0 (root):
        children: 3 4 6 8
        CPUs: 0-3
        Memory: installed 3647 Mb,  allocated 394 Mb,  free 3253 Mb
        Lgroup resources:  1 2 5 7 (CPU); 1 2 (memory)
lgroup 1 (leaf):
        children: none, parent: 3
        CPU: 0
        Memory: installed 2048 Mb,  allocated 289 Mb,  free 1759 Mb
        Lgroup resources:  1 (CPU); 1 (memory)
lgroup 2 (leaf):
        children: none, parent: 4
        CPU: 1
        Memory: installed 1599 Mb,  allocated 104 Mb,  free 1495 Mb
        Lgroup resources:  2 (CPU); 2 (memory)
lgroup 3 (intermediate):
        children: 1, parent: 0
        CPUs: 0-2
        Memory: installed 3647 Mb,  allocated 394 Mb,  free 3253 Mb
        Lgroup resources:  1 2 5 (CPU); 1 2 (memory)
lgroup 4 (intermediate):
        children: 2, parent: 0
        CPUs: 0 1 3
        Memory: installed 3647 Mb,  allocated 394 Mb,  free 3253 Mb
        Lgroup resources:  1 2 7 (CPU); 1 2 (memory)
lgroup 5 (leaf):
        children: none, parent: 6
        CPU: 2
        Lgroup resources:  5 (CPU);
lgroup 6 (intermediate):
        children: 5, parent: 0
        CPUs: 0 2 3
        Memory: installed 2048 Mb,  allocated 289 Mb,  free 1759 Mb
        Lgroup resources:  1 5 7 (CPU); 1 (memory)
lgroup 7 (leaf):
        children: none, parent: 8
        CPU: 3
        Lgroup resources:  7 (CPU);
lgroup 8 (intermediate):
        children: 7, parent: 0
        CPUs: 1-3
        Memory: installed 1599 Mb,  allocated 104 Mb,  free 1495 Mb
        Lgroup resources:  2 5 7 (CPU); 2 (memory)

We can see that it has 4 leaf nodes, 4 intermediate nodes and a root node. Pretty much what we would expect from a 4-way AMD Opteron. Adding some options we can add some eye candy to the output:


gears: $ lgrpinfo -Ta
0
|-- 3
|   CPUs: 0-2
|   Lgroup resources:  1 2 5 (CPU); 1 2 (memory)
|   `-- 1
|       CPU: 0
|       Memory: installed 2048 Mb,  allocated 291 Mb,  free 1757 Mb
|-- 4
|   CPUs: 0 1 3
|   Lgroup resources:  1 2 7 (CPU); 1 2 (memory)
|   `-- 2
|       CPU: 1
|       Memory: installed 1599 Mb,  allocated 103 Mb,  free 1496 Mb
|-- 6
|   CPUs: 0 2 3
|   Lgroup resources:  1 5 7 (CPU); 1 (memory)
|   `-- 5
|       CPU: 2
`-- 8
    CPUs: 1-3
    Lgroup resources:  2 5 7 (CPU); 2 (memory)
    `-- 7
        CPU: 3

This output shows the lgroup hierarchy in a more obvious way. We can immediately see that this system is rather interesting - CPU 0 has 2Gb of local memory, CPU 1 has 1.5 Gb of local memory and CPUs 2 and 3 have no local memory at all. Suppose that an application is running on CPU 2 and needs a page of memory. Since there is no local memory available, the system will walk up the hierarchy and will try to allocate memory from lgroup 6. Here is the description of lgroup 6:


|-- 6
|   CPUs: 0 2 3
|   Lgroup resources:  1 5 7 (CPU); 1 (memory)
|   `-- 5
|       CPU: 2

We can see that it will try to get memory from lgroup 1 which corresponds to the node containing CPU 0. Similarly, application homed to CPU 3 will usually get its memory from lgroup 2 which corresponds to a node with CPU 1. This all makes perfect sense if you look at the AMD opteron topology:

    2----3
    |    |
    |    |
    0----1

This picture shows that CPU 2 is closest to CPUs 0 and 3 and CPU 3 is closest to 2 and 1.

These topology considerations mean that if a thread is running homed to CPU 0 and the load on this CPU becomes too high, the OS will try to migrate the thread to the closest CPU - in this case either 2 or 1. This way the OS tries to keep migrated threads still close to their home memory. We can see this from the lgrpinfo output above:


|-- 3
|   CPUs: 0-2
|   Lgroup resources:  1 2 5 (CPU); 1 2 (memory)
|   `-- 1
|       CPU: 0
|       Memory: installed 2048 Mb,  allocated 291 Mb,  free 1757 Mb

The CPU 0 belongs to lgroup 1 which has lgroup 3 as its parent. The lgroup 3 has lgroups 1 2 and 5 as its CPU resources. Lgroup 1 contains CPU 0 itself, lgroup 2 contains CPU 1 and lgroup 5 contains CPU 2.

Armed with this information we can predict that it is best to run memory-intensive application on CPU 0 or CPU 1 (which is good enough if the working set fits the 1.5 Gb available on CPU1). The CPU-intensive application without much memory needs may do well on CPUs 2 and 3.

The lgrpinfo command has a useful -l option which shows the latency for memory access between different lgroups. Here is an example from the same 4-CPU AMD opteron machine:


gears: $ lgrpinfo -l

Lgroup latencies:
      0      1      2      3      4      5      6      7      8
0     135    135    135    135    135    -      135    -      135
1     99     66     99     99     99     -      66     -      99
2     99     99     66     99     99     -      99     -      66
3     135    99     135    99     135    -      99     -      135
4     135    135    99     135    99     -      135    -      99
5     135    99     135    135    135    -      99     -      135
6     135    135    135    135    135    -      99     -      135
7     135    135    99     135    135    -      135    -      99
8     135    135    135    135    135    -      135    -      99

We can see that remote access takes about 1.5 times more time than local access. The absolute value of the table is not very interesting, but the relative values provide a good approximation of the access times. This data is collected by the system during boot.

Home lgroups

When a new thread is created by the system, it examines load averages on each CPU and assigns the thread a home lgroup (which is almost always a leaf lgroup). The scheduler tries to run the thread at its home lgroup or as close to it as possible. A thread will migrate from its home lgroup if there is too much imbalance between CPU loads. Note that a thread that is bound to a specific CPU will always run on this CPU. A thread bound to a processor set will run on its home lgroup or a closest CPU within the processor set. You can determine the home lgroup of a process or a thread by calling lgrp_home(3LGRP) function within an application. Or you can determine the home lgroup of every thread in a process using the plgrp utility. Here, for example, we figure out that sendmail(1M) runs in lgroup 7:


# plgrp -G `pgrep sendmail`
7
# lgrpinfo 7
lgroup 7 (leaf):
        children: none, parent: 8
        CPU: 3
        Lgroup resources:  7 (CPU);

We can see that sendmail is running on CPU 3 which doesn't have any attached memory. If we are concerned about its performance, we can manually move it close to its memory (remember from our previous discussion that the memory is allocated from lgroup 2):


# plgrp -S 2 `pgrep sendmail`
# plgrp -G `pgrep sendmail`
2

What if a process has several threads? For example, for automountd(1M) we get:


# plgrp -G `pgrep automountd`
7
2
5

We can get information about specific threads:


# plgrp -vG `pgrep automountd`
100473/1: 7
100473/2: 2
100473/4899: 5

The 100473 is the pid of the process and the numbers 1 2 and 4899 are thread IDs. They are useful because we can move each thread individually to a new home:


# plgrp -S 1 100473/1
# plgrp -vG `pgrep automountd`
100473/1: 1
100473/2: 2

Or move all of them at once:


# plgrp -S 1 `pgrep automountd`
# plgrp -vG `pgrep automountd`
100473/1: 1
100473/2: 1

Memory Placement

When we move a process or a thread to a new home all its belongings (memory) stays at its old home. Only new memory allocations use the memory from the new home. The first question we may ask is what does the current memory allocation looks like. Grabbing the modified version of pmap(1) command from the toolkit we can easily determine where each memory segment is llocated, using the new -L option:


$ plgrp -G `pgrep emacs`
2
$ pmap -L `pgrep emacs`
07FEB000     372K rwx--   2   [ stack ]
08050000      16K r-x--   2 /opt/csw/bin/emacs-21.3
08054000      20K r-x--   1 /opt/csw/bin/emacs-21.3
0838E000    1652K rwx--   2 /opt/csw/bin/emacs-21.3
0852B000    2900K rwx--   2   [ heap ]
08800000    3124K rwx--   2   [ heap ]
BF960000       4K rwx--   2   [ anon ]
BF970000      24K r-x--   1 /lib/libm.so.2
BF976000       4K r-x--   2 /lib/libm.so.2
...
BF9DE000       4K rwxs-   2   [ anon ]
BF9F0000      12K rwx--   2   [ anon ]
BF9F4000       8K rwx--   2   [ anon ]
BFA00000       4K rwx--   2   [ anon ]
BFA10000       4K rwx--   2   [ anon ]
BFA20000     128K r-x--   1 /lib/libc.so.1
BFA42000      28K r-x--   1 /lib/libc.so.1
BFA49000      16K r-x--   2 /lib/libc.so.1
...

We can see that most of the text segment and all the heap is allocated from the home lgroup. The segments for the library code (libm and libc) is spread around - the library code is shared and its allocation depends on which process allocated it first.

How can we control the memory placement? Application writers can use madvise(3C) library call to control it from the application itself. It is especially useful for applications which allocate and initialize memory from one thread and use it by another one. The user can also use madv.so.1 library to apply memory advise before starting the application. For an already running application developer can use the pmadvise(1) tool from the toolkit. The following example shows how we can move Emacs to a new home and advise it to take along move the memory it uses:


$ plgrp -S 1 `pgrep emacs`
$ pmadvise -o heap=access_lwp,stack=access_lwp `pgrep emacs`
... Play with Emacs a bit ...
$ pmap -L `pgrep emacs` | egrep '(heap|stack)'
$ pmap -L `pgrep emacs` | egrep '(heap|stack)' 08017000     188K rwx--   1   [ stack ]
08017000     188K rwx--   1   [ stack ]
08046000       8K rwx--   2   [ stack ]
0852B000      32K rwx--   1   [ heap ]
08534000      76K rwx--   1   [ heap ]
08549000       4K rwx--   1   [ heap ]
0854B000      20K rwx--   1   [ heap ]
08551000      16K rwx--   1   [ heap ]
...

Almost all the stack and all the heap migrated to a new home! We successfully completed the move with just a few commands!

Where is DTrace?

Of course, we can't talk about Solaris without mentioning the almighty DTrace. We have all the tools to control long-running applications, but how can we enforce the policy automatically whenever an application starts? DTrace can this problem, of course! Let us create the following D program and call it papply.d:

#!/usr/sbin/dtrace -qws
/* We use -w to allow destructive actions */

/*
 * When the process reaches main(), stop it, re-home and restart again.
 */
pid$target::main:entry
{
	stop();
	system("plgrp -S 1 %d", $target);
	system("pmadvise -o heap=access_lwp,stack=access_lwp %d", $target);
	system("prun %d", $target);
}

Now we can do things like


$ apply.d -c emacs 

Unfortunately, this doesn't work since plgrp can't deal with a process which is already traced, although pmap and pmadvise can if -F option is supplied. The next version of the plgrp will include -F option which will fix the problem. Stay tuned!


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

( Aug 23 2005, 11:57:42 PM PDT ) Permalink Comments [0]

Trackback URL: http://blogs.sun.com/akolb/entry/numa_observability_tools
Comments:

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

Calendar

RSS Feeds

Search

Links

Navigation

Referers