NUMA.1
Starting this post, I'll write a series about NUMA architectures and what operating systems do to support and optimize app performance on these machines. I'll focus on OpenSolaris, but will also write about Linux as I've studied a bit about its features.
For starters, what's NUMA and why should I care ?
NUMA stands for Non Uniform Memory Access. It's a multiprocessor architecture that evolved from the Symmetric MultiProcessor (SMP) model.
With SMP's, we have multiple processors or cores and a single memory bank. Every memory access goes through a single bus, and access times are the same throughout the system - a uniform memory access model, or UMA.
As new cores or processors are added, the single bus becomes a bottleneck and saturates quickly. So it's a model that doesn't scale in performance as you add processing power.
The idea with NUMA is to overcome this bottleneck by grouping processors and memory in nodes, interconnected by a bus - or an interconnect. Resulting in a much more scalable architecture.
However, this arrangement of processors and nodes within different physical distances of one another causes non uniform memory access (NUMA) times throughout the system. A processor accessing local memory will get the information in less time than accessing a remote memory position because it won't need to travel across the interconnect. This characteristic is known as the NUMA factor, the ratio between local and remote access times.
![]() | ![]() |
| SMP | NUMA |
All physical memory is setup so processors see a single shared address space, transparently to the operating system and the user. Since each processor has private cache and the entire system shares memory, it's necessary to guarantee cache coherence among nodes. The first NUMA machines out there didn't implement this coherence in hardware, it was the programmer's job to do so. But because it added a lot of complexity to the software, that layer went to the hardware, resulting in ccNUMA - cache coherent NUMA. Nowadays, the terms NUMA and ccNUMA are used interchangeably, as no hardware manufacturer builds non cache coherent NUMA systems anymore.
You should care because these machines are becoming cheaper and more commonly adopted. If you work with parallel programming, you might come in contact with one of these. Learning how the OS supports the architecture, how different memory access times affects the performance of your application and what to do about that, are definitely of interest.
This looks a lot like a cluster within a single box, right ?
Well, there are two BIG advantages over a cluster:
1. Lower latencies when going through the interconnect.
2. With clusters, parallel applications communicate through message passing - that's send(3SOCKET) and recv(3SOCKET) calls - as each node has a separate address space. A NUMA machine has a shared memory space, so it's reading and writing memory positions as most developers are used to. No need to rewrite all your apps to a network programming paradigm.

