Following NUMA.2 and NUMA.1, here's one about how OpenSolaris represents NUMA architectures.

Topology Representation

Quick recap.

A NUMA machine is composed of nodes that contain some kind of hardware resource: processors, memory, devices.. These nodes are connected through an interconnect hardware that allows each node to access every other node's memory transparently, forming a single shared memory space. Every node can access the entire memory space, but because they have to go through the interconnect to access remote ones, accessing local memory is faster than accessing remote memory.

The OS needs to be aware of this situation and know exactly where, in physical memory, a node ends and another one begins. OpenSolaris uses a kernel abstraction called locality groups - or simply lgroups - to represent sets of resources within some distance of each other. Lgroups are created during system boot and form a latency topology used by the scheduler/dispatcher and the VM subsystems to allocate resources properly. This topology is hierarchical, containing lower latencies at the leafs and higher access latencies at the root. Let's look at an example:

(a) 4 node machine (ring)(b) leaf lgroups

The example above show that a four node machine with a ring topology will have four lgroups, one for each node. These are leaf lgroups, they will be at the lowest level of the hierarchy because they represent only local accesses.

Remote access will increase latencies depending on how far you're going. If it's nodes one hop away, it's one thing. If we're going to the furthest node, it's another. The system creates intermediate lgroups to represent local and well, intermediate distances (in this case, one hop away). It also creates a root lgroup, that contains all the resources in the system, and represents the highest level of latencies.

(c) intermediate lgroup around node 0(d) root lgroup

Figure (c) shows the intermediate lgroup 5 formed around node 1, it contains lgroups 1, 2 and 4.
This might seem a bit confusing without looking at the whole topology, and how the system uses it.

(e) lgroup topology

This topology is hierarchical, as mentioned earlier. We have the lowest latencies at the bottom, and the highest at the top. The scheduler/dispatcher and the VM subsystems consult this topology to move threads around and to allocate memory, respectively.

The system will try to allocate resources (CPU, memory) at the lgroup in which the thread is located. If it can't get the resources there, it will move up on the hierarchy and consider the next closest resources. So it will first consider the "neighboring" nodes and if that still doesn't work, it will consider the entire system's resources - represented by the root lgroup.

The idea behind this coarse of action is to maintain threads and their resources as close as possible. This results in lower access times and takes advantage of cache warmth when possible.

Next time, I'll write about load balancing and processor partitions, and how they fit into all of these.

Comments:

But is Solaris and Sun HW to the point where it is possible to connect two servers via the NUMAlink(TM) cable, stick a NUMA router in one of them, configure one as master and one as slave, then be able to press the power button and have both systems power on and come up as one, running one single OS image?

For comparison: sgi had this technology working over 15 years ago!

Posted by UX-admin on March 25, 2008 at 05:05 AM PDT #

[Trackback] Bookmarked your post over at Blog Bookmarker.com!

Posted by coarse on March 25, 2008 at 09:49 AM PDT #

AFAIK, it's not possible to use the NUMALink cable with Sun HW - at least not out of the box.
Solaris will pick it up as long as the machine (NUMALink'ed) presents itself as a NUMA system according to standards, and not through a proprietary interface/solution.

Interesting question though. Do you know if Linux picks it up by default or does it require specific modules for it?

Posted by Rafael on March 25, 2008 at 03:55 PM PDT #

[Trackback] Gizmodo DE : Ein Traum von Unsterblichkeit Barcodes am Grab ... NUMA.3 Topology Representation : rv's techblog About NUMA and lgroups (tags: os solaris)

Posted by c0t0d0s0.org on March 29, 2008 at 04:32 AM PDT #

It's very clear and easy to understand. Thanks!
Keep going!

Posted by adam on March 29, 2008 at 08:47 PM PDT #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by rv