Monday Mar 24, 2008

Following NUMA.2 and NUMA.1, here's one about how OpenSolaris represents NUMA architectures.

Topology Representation

Quick recap.

A NUMA machine is composed of nodes that contain some kind of hardware resource: processors, memory, devices.. These nodes are connected through an interconnect hardware that allows each node to access every other node's memory transparently, forming a single shared memory space. Every node can access the entire memory space, but because they have to go through the interconnect to access remote ones, accessing local memory is faster than accessing remote memory.

The OS needs to be aware of this situation and know exactly where, in physical memory, a node ends and another one begins. OpenSolaris uses a kernel abstraction called locality groups - or simply lgroups - to represent sets of resources within some distance of each other. Lgroups are created during system boot and form a latency topology used by the scheduler/dispatcher and the VM subsystems to allocate resources properly. This topology is hierarchical, containing lower latencies at the leafs and higher access latencies at the root. Let's look at an example:

(a) 4 node machine (ring)(b) leaf lgroups

The example above show that a four node machine with a ring topology will have four lgroups, one for each node. These are leaf lgroups, they will be at the lowest level of the hierarchy because they represent only local accesses.

Remote access will increase latencies depending on how far you're going. If it's nodes one hop away, it's one thing. If we're going to the furthest node, it's another. The system creates intermediate lgroups to represent local and well, intermediate distances (in this case, one hop away). It also creates a root lgroup, that contains all the resources in the system, and represents the highest level of latencies.

(c) intermediate lgroup around node 0(d) root lgroup

Figure (c) shows the intermediate lgroup 5 formed around node 1, it contains lgroups 1, 2 and 4.
This might seem a bit confusing without looking at the whole topology, and how the system uses it.

(e) lgroup topology

This topology is hierarchical, as mentioned earlier. We have the lowest latencies at the bottom, and the highest at the top. The scheduler/dispatcher and the VM subsystems consult this topology to move threads around and to allocate memory, respectively.

The system will try to allocate resources (CPU, memory) at the lgroup in which the thread is located. If it can't get the resources there, it will move up on the hierarchy and consider the next closest resources. So it will first consider the "neighboring" nodes and if that still doesn't work, it will consider the entire system's resources - represented by the root lgroup.

The idea behind this coarse of action is to maintain threads and their resources as close as possible. This results in lower access times and takes advantage of cache warmth when possible.

Next time, I'll write about load balancing and processor partitions, and how they fit into all of these.

Thursday Mar 20, 2008

I spent a(nother) few hours last week studying lgroup partition loads - the not so famous lpl's - and after an email to the OpenSolaris Performance Community I got a couple of very good replies. So I decided to write it up here and add some documentation to this. You can check out the original thread here.

(a) lpl's are the intersection of lgroups and CPU partitions. When you have CPU partitions cutting across lgroups, you need to constrain the set of CPUs where a thread can run but also consider the lgrp's CPUs that are within that constrained set.

(b) lpl's are used to determine the load average of lgroups, which is used by the dispatcher to load balance threads around.

(c) the number of lpls an lgrp can have associated is limited only by the number of logical CPUs it contains. Leaf lgroups will have associated leaf lpls. Just as the lpl is the result of intersecting an lgroup with a CPU partition, the lpl hierarchy is the result of intersecting a cpu partition with the lgrp hierarchy.

(d) lpl topology is created by lgroup code and triggered by processor set instantiation / destruction (and by creation of the default CPU partition)

lgroups and processor sets are key concepts when scheduling and dispatching threads, but they can very easily be orthogonal to each other. Processor sets cutting across lgroups could make the entire MPO effort go away by making threads run in completely different nodes. This would add latency to memory access operations and bring performance down. lpl's make sure that doesn't happen. "Possible orthogonality" is a good way to describe it, IMO, since a partition may or may not span across lgroups.

This blog copyright 2009 by rv