Monday Mar 24, 2008

Following NUMA.2 and NUMA.1, here's one about how OpenSolaris represents NUMA architectures.

Topology Representation

Quick recap.

A NUMA machine is composed of nodes that contain some kind of hardware resource: processors, memory, devices.. These nodes are connected through an interconnect hardware that allows each node to access every other node's memory transparently, forming a single shared memory space. Every node can access the entire memory space, but because they have to go through the interconnect to access remote ones, accessing local memory is faster than accessing remote memory.

The OS needs to be aware of this situation and know exactly where, in physical memory, a node ends and another one begins. OpenSolaris uses a kernel abstraction called locality groups - or simply lgroups - to represent sets of resources within some distance of each other. Lgroups are created during system boot and form a latency topology used by the scheduler/dispatcher and the VM subsystems to allocate resources properly. This topology is hierarchical, containing lower latencies at the leafs and higher access latencies at the root. Let's look at an example:

(a) 4 node machine (ring)(b) leaf lgroups

The example above show that a four node machine with a ring topology will have four lgroups, one for each node. These are leaf lgroups, they will be at the lowest level of the hierarchy because they represent only local accesses.

Remote access will increase latencies depending on how far you're going. If it's nodes one hop away, it's one thing. If we're going to the furthest node, it's another. The system creates intermediate lgroups to represent local and well, intermediate distances (in this case, one hop away). It also creates a root lgroup, that contains all the resources in the system, and represents the highest level of latencies.

(c) intermediate lgroup around node 0(d) root lgroup

Figure (c) shows the intermediate lgroup 5 formed around node 1, it contains lgroups 1, 2 and 4.
This might seem a bit confusing without looking at the whole topology, and how the system uses it.

(e) lgroup topology

This topology is hierarchical, as mentioned earlier. We have the lowest latencies at the bottom, and the highest at the top. The scheduler/dispatcher and the VM subsystems consult this topology to move threads around and to allocate memory, respectively.

The system will try to allocate resources (CPU, memory) at the lgroup in which the thread is located. If it can't get the resources there, it will move up on the hierarchy and consider the next closest resources. So it will first consider the "neighboring" nodes and if that still doesn't work, it will consider the entire system's resources - represented by the root lgroup.

The idea behind this coarse of action is to maintain threads and their resources as close as possible. This results in lower access times and takes advantage of cache warmth when possible.

Next time, I'll write about load balancing and processor partitions, and how they fit into all of these.

Monday Jan 21, 2008

So, following NUMA.1, here's the second part of the series about NUMA architectures.

Node Affinity

NUMA systems have a shared memory space composed of every node's individual memory. The total physical memory is the sum of each node's individual memory, which is consumed by the operating system as it allocates space for itself and user applications.

Every process has its own address space, composed of virtual addresses that are mapped to physical ones. This means, among other things, that a process has no idea of where, in physical memory, it is allocated. It could be contiguous at a single node or spread out among every node.

In the latter, a process would have different latency times when accessing its own memory positions. For instance, if a process is created at the first node and, as it grows, starts to allocate memory at the second node. Accessing this newly allocated memory means going through the interconnect, which takes longer than accessing local memory positions and causes app performance to decrease.

Ideally, we'd like to avoid remote memory accesses. In other words, allocate memory at the node in which the process was originally created so that its entire address space sits at a single node. Or if not possible, try to allocate memory at the nearest node to minimize access times.

By doing this, the system is respecting the process' affinity to its original node.

Node affinity is important both for single and multi threaded processes. The earlier example used a process with a single execution flow (thread). For multi threaded apps, the story is a little bit different.

If your MT app creates n threads to execute independently, we would like to have each thread running at a different processing unit and fully utilize the system's nodes - as long as the current system load allows us.

But if those threads rely heavily on synchronization or data sharing, spreading them around the system will increase remote accesses. In such case, we'd like to have a balance between local accesses and available cpu power. Again, it's important to maintain node affinity but load balancing is also something to be on the lookout for.

Next time, I'll write about how OpenSolaris and Linux represent the system topology according to latency times and how the scheduler and the VM subsystems optimize for node latency.

Monday Nov 26, 2007

Starting this post, I'll write a series about NUMA architectures and what operating systems do to support and optimize app performance on these machines. I'll focus on OpenSolaris, but will also write about Linux as I've studied a bit about its features.

For starters, what's NUMA and why should I care ?

NUMA stands for Non Uniform Memory Access. It's a multiprocessor architecture that evolved from the Symmetric MultiProcessor (SMP) model.

With SMP's, we have multiple processors or cores and a single memory bank. Every memory access goes through a single bus, and access times are the same throughout the system - a uniform memory access model, or UMA.

As new cores or processors are added, the single bus becomes a bottleneck and saturates quickly. So it's a model that doesn't scale in performance as you add processing power.

The idea with NUMA is to overcome this bottleneck by grouping processors and memory in nodes, interconnected by a bus - or an interconnect. Resulting in a much more scalable architecture.

However, this arrangement of processors and nodes within different physical distances of one another causes non uniform memory access (NUMA) times throughout the system. A processor accessing local memory will get the information in less time than accessing a remote memory position because it won't need to travel across the interconnect. This characteristic is known as the NUMA factor, the ratio between local and remote access times.

SMPNUMA

All physical memory is setup so processors see a single shared address space, transparently to the operating system and the user. Since each processor has private cache and the entire system shares memory, it's necessary to guarantee cache coherence among nodes. The first NUMA machines out there didn't implement this coherence in hardware, it was the programmer's job to do so. But because it added a lot of complexity to the software, that layer went to the hardware, resulting in ccNUMA - cache coherent NUMA. Nowadays, the terms NUMA and ccNUMA are used interchangeably, as no hardware manufacturer builds non cache coherent NUMA systems anymore.

You should care because these machines are becoming cheaper and more commonly adopted. If you work with parallel programming, you might come in contact with one of these. Learning how the OS supports the architecture, how different memory access times affects the performance of your application and what to do about that, are definitely of interest.

This looks a lot like a cluster within a single box, right ?

Well, there are two BIG advantages over a cluster:
1. Lower latencies when going through the interconnect.
2. With clusters, parallel applications communicate through message passing - that's send(3SOCKET) and recv(3SOCKET) calls - as each node has a separate address space. A NUMA machine has a shared memory space, so it's reading and writing memory positions as most developers are used to. No need to rewrite all your apps to a network programming paradigm.

This blog copyright 2009 by rv