Monday Mar 24, 2008

Following NUMA.2 and NUMA.1, here's one about how OpenSolaris represents NUMA architectures.

Topology Representation

Quick recap.

A NUMA machine is composed of nodes that contain some kind of hardware resource: processors, memory, devices.. These nodes are connected through an interconnect hardware that allows each node to access every other node's memory transparently, forming a single shared memory space. Every node can access the entire memory space, but because they have to go through the interconnect to access remote ones, accessing local memory is faster than accessing remote memory.

The OS needs to be aware of this situation and know exactly where, in physical memory, a node ends and another one begins. OpenSolaris uses a kernel abstraction called locality groups - or simply lgroups - to represent sets of resources within some distance of each other. Lgroups are created during system boot and form a latency topology used by the scheduler/dispatcher and the VM subsystems to allocate resources properly. This topology is hierarchical, containing lower latencies at the leafs and higher access latencies at the root. Let's look at an example:

(a) 4 node machine (ring)(b) leaf lgroups

The example above show that a four node machine with a ring topology will have four lgroups, one for each node. These are leaf lgroups, they will be at the lowest level of the hierarchy because they represent only local accesses.

Remote access will increase latencies depending on how far you're going. If it's nodes one hop away, it's one thing. If we're going to the furthest node, it's another. The system creates intermediate lgroups to represent local and well, intermediate distances (in this case, one hop away). It also creates a root lgroup, that contains all the resources in the system, and represents the highest level of latencies.

(c) intermediate lgroup around node 0(d) root lgroup

Figure (c) shows the intermediate lgroup 5 formed around node 1, it contains lgroups 1, 2 and 4.
This might seem a bit confusing without looking at the whole topology, and how the system uses it.

(e) lgroup topology

This topology is hierarchical, as mentioned earlier. We have the lowest latencies at the bottom, and the highest at the top. The scheduler/dispatcher and the VM subsystems consult this topology to move threads around and to allocate memory, respectively.

The system will try to allocate resources (CPU, memory) at the lgroup in which the thread is located. If it can't get the resources there, it will move up on the hierarchy and consider the next closest resources. So it will first consider the "neighboring" nodes and if that still doesn't work, it will consider the entire system's resources - represented by the root lgroup.

The idea behind this coarse of action is to maintain threads and their resources as close as possible. This results in lower access times and takes advantage of cache warmth when possible.

Next time, I'll write about load balancing and processor partitions, and how they fit into all of these.

Thursday Mar 20, 2008

I spent a(nother) few hours last week studying lgroup partition loads - the not so famous lpl's - and after an email to the OpenSolaris Performance Community I got a couple of very good replies. So I decided to write it up here and add some documentation to this. You can check out the original thread here.

(a) lpl's are the intersection of lgroups and CPU partitions. When you have CPU partitions cutting across lgroups, you need to constrain the set of CPUs where a thread can run but also consider the lgrp's CPUs that are within that constrained set.

(b) lpl's are used to determine the load average of lgroups, which is used by the dispatcher to load balance threads around.

(c) the number of lpls an lgrp can have associated is limited only by the number of logical CPUs it contains. Leaf lgroups will have associated leaf lpls. Just as the lpl is the result of intersecting an lgroup with a CPU partition, the lpl hierarchy is the result of intersecting a cpu partition with the lgrp hierarchy.

(d) lpl topology is created by lgroup code and triggered by processor set instantiation / destruction (and by creation of the default CPU partition)

lgroups and processor sets are key concepts when scheduling and dispatching threads, but they can very easily be orthogonal to each other. Processor sets cutting across lgroups could make the entire MPO effort go away by making threads run in completely different nodes. This would add latency to memory access operations and bring performance down. lpl's make sure that doesn't happen. "Possible orthogonality" is a good way to describe it, IMO, since a partition may or may not span across lgroups.

Monday Jan 21, 2008

So, following NUMA.1, here's the second part of the series about NUMA architectures.

Node Affinity

NUMA systems have a shared memory space composed of every node's individual memory. The total physical memory is the sum of each node's individual memory, which is consumed by the operating system as it allocates space for itself and user applications.

Every process has its own address space, composed of virtual addresses that are mapped to physical ones. This means, among other things, that a process has no idea of where, in physical memory, it is allocated. It could be contiguous at a single node or spread out among every node.

In the latter, a process would have different latency times when accessing its own memory positions. For instance, if a process is created at the first node and, as it grows, starts to allocate memory at the second node. Accessing this newly allocated memory means going through the interconnect, which takes longer than accessing local memory positions and causes app performance to decrease.

Ideally, we'd like to avoid remote memory accesses. In other words, allocate memory at the node in which the process was originally created so that its entire address space sits at a single node. Or if not possible, try to allocate memory at the nearest node to minimize access times.

By doing this, the system is respecting the process' affinity to its original node.

Node affinity is important both for single and multi threaded processes. The earlier example used a process with a single execution flow (thread). For multi threaded apps, the story is a little bit different.

If your MT app creates n threads to execute independently, we would like to have each thread running at a different processing unit and fully utilize the system's nodes - as long as the current system load allows us.

But if those threads rely heavily on synchronization or data sharing, spreading them around the system will increase remote accesses. In such case, we'd like to have a balance between local accesses and available cpu power. Again, it's important to maintain node affinity but load balancing is also something to be on the lookout for.

Next time, I'll write about how OpenSolaris and Linux represent the system topology according to latency times and how the scheduler and the VM subsystems optimize for node latency.

Monday Nov 26, 2007

Starting this post, I'll write a series about NUMA architectures and what operating systems do to support and optimize app performance on these machines. I'll focus on OpenSolaris, but will also write about Linux as I've studied a bit about its features.

For starters, what's NUMA and why should I care ?

NUMA stands for Non Uniform Memory Access. It's a multiprocessor architecture that evolved from the Symmetric MultiProcessor (SMP) model.

With SMP's, we have multiple processors or cores and a single memory bank. Every memory access goes through a single bus, and access times are the same throughout the system - a uniform memory access model, or UMA.

As new cores or processors are added, the single bus becomes a bottleneck and saturates quickly. So it's a model that doesn't scale in performance as you add processing power.

The idea with NUMA is to overcome this bottleneck by grouping processors and memory in nodes, interconnected by a bus - or an interconnect. Resulting in a much more scalable architecture.

However, this arrangement of processors and nodes within different physical distances of one another causes non uniform memory access (NUMA) times throughout the system. A processor accessing local memory will get the information in less time than accessing a remote memory position because it won't need to travel across the interconnect. This characteristic is known as the NUMA factor, the ratio between local and remote access times.

SMPNUMA

All physical memory is setup so processors see a single shared address space, transparently to the operating system and the user. Since each processor has private cache and the entire system shares memory, it's necessary to guarantee cache coherence among nodes. The first NUMA machines out there didn't implement this coherence in hardware, it was the programmer's job to do so. But because it added a lot of complexity to the software, that layer went to the hardware, resulting in ccNUMA - cache coherent NUMA. Nowadays, the terms NUMA and ccNUMA are used interchangeably, as no hardware manufacturer builds non cache coherent NUMA systems anymore.

You should care because these machines are becoming cheaper and more commonly adopted. If you work with parallel programming, you might come in contact with one of these. Learning how the OS supports the architecture, how different memory access times affects the performance of your application and what to do about that, are definitely of interest.

This looks a lot like a cluster within a single box, right ?

Well, there are two BIG advantages over a cluster:
1. Lower latencies when going through the interconnect.
2. With clusters, parallel applications communicate through message passing - that's send(3SOCKET) and recv(3SOCKET) calls - as each node has a separate address space. A NUMA machine has a shared memory space, so it's reading and writing memory positions as most developers are used to. No need to rewrite all your apps to a network programming paradigm.

Monday Jul 02, 2007

So, I'm doing my graduation project about NUMA architectures and what OpenSolaris and Linux do to support them. I'm halfway through it, getting to the implementation part of the project.
Here's the current book pile residing at my desk, it started lower, got a little bigger with some other OS books that I had to return to the library, and it's been stable (except for the occasional accident) for the last couple of weeks.


Starting at the top:
   "Just Enough UNIX", Paul K. Anderson
   "Solaris 9 Administration Guide", Paul Watters
   "Concurrent Programming in Java: Design Principles and Patterns", Doug Lea
   "Design Patterns", the Gang of Four one
   "Operating Systems: Internals and Design Principles", Stallings
   "Linux Kernel Internals", Beck and a bunch of equally intelligent folks
   "Solaris Internals", Richard McDougall and Jim Mauro
   "Linux Kernel Development", Robert Love (from Novell)
   "UNIX Systems for Moderns Architectures", Schimmel
   "Pthreads Programming", Nichols et al
   "Linux Programming", Neil Matthew and Richard Stones
   "UNIX Internals", Vahalia

Check out a couple of paragraphs from the VM part of the monography, feel free to comment on it. It's what I'm calling a "release candidate for a draft".

"Usually, UNIX processes on 32bit architectures can theoretically occupy as much as 4Gb of space. However, memory is a scarce resource and the system shares it between all running processes. So when it is time to bring a new process to memory, the system begins by bringing a few necessary pieces of the program. This portion that resides in memory is called the resident set of the process. If the processor encounters a memory reference to a part of the process that is not currently loaded, it blocks the execution of such process and brings that data to memory from disk [STA 05]."
...
"UNIX systems use both virtual memory and paging as memory management techniques. We now have processes divided into pages, being loaded into memory frames as these pages are needed during execution. Such method of bringing pages as they're necessary is called demand paging."

I'm currently finishing the MPO/liblgrp and libnuma part of the paper. Once that's done, it's just a matter of polishing the whole thing and making it more readable. It gets difficult to explain even the simpler concepts with a new perspective, focusing on what the project is about.

But I'm getting there.

This blog copyright 2009 by rv