Monday July 11, 2005 Most people probably want a 'scalable' system when they buy a computer. 'Scalable' can be taken to mean a whole bunch of different things, but for what I am talking about here we can presume it means both the ability for the system to support increases in workload while maintaining the same transaction response time, and that by doubling the system configuration (the resources) it can support twice the workload.
Scalability of a system is not really about what is there in the system, but rather what is 'not' there. Scalability becomes limited or constrained when something gets in the way. Less is more. Good scalable design is about either avoiding the things that can get in the way, or explicitly designing around them when they are unavoidable. Examples of things that have limited system scalability in the past include:
The classic description of the results from this kind of scalability is Amdahl's Law. However, computer hardware and software vendors have spent many years developing various techniques that let them build and deploy large systems that do not suffer from Amdahl's Law. So, generally, today's systems are more balanced designs, that scale well within their stated capacity.
With today's generation of computers, it is actually the memory sub-system that limits the scalability. This is because the processors in computers have got at least 1,000 times faster in the past 10 years, whereas memory has probably only got 100 times faster. It is the relative ratio between these two that is important, not the absolute amount of the increases. Also, the other end of the computer system where it connects to the outside world – the network – and where work comes from and results go back to, has also got a lot faster over the past 10 years. We have moved from 10 Mb/sec through 100 Mb/sec to 1 Gb/sec (Gigabit Ethernet) as standard for many networks.
Modern processors work internally at multiples of the external system bus frequency – anything from 4 to 10 times would be possible. So a 2 GHz processor may be interfaced to a 400 MHz bus, for a multiple of 5. This already shows that any memory access is going to waste multiple internal cycles of the processor. On top of this, modern memory sub-systems do not respond with the data within a single system cycle. It is several system cycles after being given an address that they respond with the data.
How does this affect a CPU? If a CPU had an internal clock speed of 1 GHz, then one CPU cycle is 1 ns (nanosecond). If the total time to obtain data from memory was 100 ns, then the CPU has been idle for most of the 100 CPU cycles. (It will not have been totally idle, as modern CPUs use an internal pipeline of sub-tasks to execute each instruction in parts. Stages of the pipeline already executing other instructions will be able to finish them during the memory access).
So in modern computers we end up with the situation that often the CPU is idle, wasting cycles waiting for data from memory, and that it is the memory sub-system that is the limitation to how the system scales as more work engines (CPUs) are added to the system. Clearly there is an imbalance between a processor's ability to do work and how quickly the work can be supplied, so that the rest of the system spends time waiting on memory. This is where the key focus of good, balanced system design should be.
A good system design principle is therefore about hiding this difference between real memory speed and CPU speed, so that the impact on the CPU of the much slower memory is minimised. Most CPUs today have areas of silicon on them dedicated to this memory interface, doing things to try and offset the relative cost of memory access. This is where you will find things like data and instruction caches running at the same speed as the CPU, branch prediction, pre-fetch buffers and write behind buffers. Many of these are aimed at trying to get the data before the CPU needs it, which is not always possible due to the variations in how programs behave.
Sun's future Niagara processor has a new approach to this 'memory speed hiding' principle, by having four threads co-exist within the CPU's execution core at the same time. The CPU will only ever be executing one of these threads at any moment in time, as other current CPUs do. However, when the currently executing thread needs an external memory access, the CPU simply switches to another thread while this is happening. Thus the delay incurred for the memory access for one thread, is actively used to execute instructions of another thread. This has a number of benefits:
As someone who spends a lot of time concerned with the performance of computer systems, and the actual performance achieved by customers with their applications running on real hardware, the Niagara processor looks like a great win-win deal to me. It uses a simpler CPU execution core design, has a zero-cost switch between threads, hides memory access times, and increases overall system throughput and utilisation. And with less hardware (just one processor) than current systems.
I believe that as Sun ships an actual system using the resultant CPU from Niagara we will see radically different behaviour profiles from systems and their applications. We will have to learn to interpret CPU utilisation and application throughput in different ways. An existing application could behave differently on a Niagara based system, and achieve a greater throughput, yet with only a single CPU. In this case, less is truly more.
As the saying goes – “May you live in interesting times”.
( Jul 11 2005, 02:31:19 PM BST ) Permalink Comments [0]