Todd Jobson's Blog Reflections
Processors and Performance : Chips, MIPS, and Sizing blips..
The following post is a close proximity of an article that I published in this month's Sun "Technocrat" (November 2007) issue. Hopefully you'll enjoy this discussing regarding the past and present relationship of CPU's and architecture to system performance.
In
today's fast
paced world of ever increasing demands for system throughput, the
foundation of discussion and expectations typically all hinge upon the
same topic.. CPU performance. This article will be an
examination of CPU's and system architecture, as they relate to
performance and capacity planning as a whole. From our last
discussion "The
Many Flavors of System Latency ..." , we will extend the context to
focus on past and present competing aspects of system/CPU
architecture, including a brief history of how we got to the
current competitive landscape we find ourselves in today. (The photo to the left is of a Sun T1
"coolthreads" 8
core, 32 thread cpu)
CISC vs. RISC
Going back to the early days of microprocessor design (and also likely a familiar topic from your Computer Science Bachelor's curriculum), you will recall much conversation, speculation, and competition surrounding two competing approaches to CPU architecture : CISC vs. RISC.
In a nutshell, CISC (Complex Instruction Set Computers) designs emphasize the use of "complex" instructions within the HW to minimize the amount of Assembly code (SW) required. Other benefits are that compilers don't need to be as complex, as well as CISC CPU's requiring less RAM to store instructions. However, this approach sometimes requires more than one clock cycle of a processor to complete processing a complex instruction.
RISC (Reduced Instruction Set Computers), just as the name refers, offer a reduced number of simple instructions that can complete execution within a single clock cycle, but which might require multiple instructions for a complex operation such as "multiply". The RISC approach has the nice side-effect of also requiring less chip "footprint" (in # of transistors, etc..) reserving more die space for memory registers, while also typically offering streamlined execution within the same window of execution time as CISC counterparts. Modern compilers (such as Sun's Studio 12 line) offer significant performance benefits that should not be overlooked, especially critical when developing and compiling binaries on RISC based architecture (Sun has seen performance optimized benefits of 200+ % when using the latest releases of Sun Studio Compilers vs. generic "gcc" compiled code).
Today, the market is segmented in more or less the same 2 camps, but they are divided down the lines of "x86" compatible CPU's (modern CISC designs from the Intel and AMD) and the modern RISC competition from Sun (SPARC) and IBM (Power), where HP and DEC (now gone altogether) have stepped back from RISC manufacturing.
Is there more to Moore's Law ?
For nearly the past 30+ years, we have seen processor performance
double
according to Moore's Law (approximately every
2 years), a movement from Uniprocessor based systems to those that
required scalability vertically, into multi-processor systems that
we've become familiar with in the Unix world, better known as SMP
(Symmetric MultiProcessor) systems. It's funny that Gordon
Moore made his original claim in a 1965
article of electronics magazine originally regarding the trend of
Integrated Circuit component
counts doubling every year, only loosely predicting this trend
might continue until 1975, though he was uncertain that any future
projections could be made (he later changed his
prediction in 1970 to "doubling every 2 years", which has held
relatively steady ever since). Since past trends of IC /
CPU transistor counts have closely correlated to CPU performance, the
association of Moore's Law to performance was made. In order to accommodate this rapid rate of growth in processor performance, the number of transistor's contained within a CPU has been just one of the key characteristics that have climbed (in addition to clock speeds, etc..). Amazing as it may sound, but 45nm (nano-meter) manufacturing fabrication is expected by Intel and others hopefully in 2008 (when just 10 years ago we had 500nm manufacturing) . Even at the levels that we were at in 2004 with 90nm manufacturing, the width of a transistor (50nm across) was 1/2 the diameter of a single celled influenza virus ! Given that we are approaching atomic dimensions of gate thickness (at < 1nm), the pace of transistor density and clock frequency increases (using current designs) appears to be approaching the physical limitations of manufacturing. In addition to these concerns, power and related heat issues have become very pronounced in today's "green" computing campaigns. Luckily, Sun has dealt a hand worthy of industry recognition that includes CPU innovations to keep us progressing forward, however addressing more than simple "transistor counts", but more aptly the efficiency of moving an SMP like architecture "onto" a piece of silicon, hence.. a system on a chip (which is essentially what we have with the T2).
How many ways can you weave those
THREADS ?
SMP systems required an Operating System kernel that had a means of
fairly sharing the CPU resources among the processes awaiting execution
within the system (via priority-based
scheduling classes). This capability within Unix for "time
slicing" between "runnable" processes and available kernel and physical
CPU resources on systems hinges on the kernel dispatcher associating
processes with "light weight processes" that in turn get bound to
kernel "Threads", in order to be run within the "context" of physical
processor registers and an execution pipeline (aka, HW Threads).
For the most part, this is how today's Solairs OS functions, along with
the appropriate dose of preemption and locking mechanisms.Over the past decade, and along with the increased demands of internet traffic (and associated application workloads), applications have gradually become better able to scale vertically within systems, primarily through the use of SW multi-threading. Within a multi-threaded application, many threads of execution can run simultaneously across available CPU's within a system, allowing for an application to scale as close to "linearly" as possible (doubling the application throughput as the # of available CPU's doubled).
THREAD (of Execution) :
- noun One of many software "threads" of execution that can be processed simultaneously on a computer system.
- noun The ability to increase system performance (throughput) at the same rate that resources (cpu's,..)are added.
CPU's and Memory : UMA and NUMA
Modern computing systems also offer a "shared memory" model that has
very
specific performance and latency related characteristics, depending
upon the system design (physical system interconnect type
[bus/crossbar], memory management controllers, proximity to physical
cache/RAM, etc..), the Operating System Kernel memory management
(memory mgmt libraries, JVM Garbage collection, etc..), as well as the
Application SW execution characteristics and memory requirements (cache
hit ratio's [Instruction vs. Data Cache], TLB miss characteristics, RAM
requirements, ..).Among the modern types of vertically scalable parallel computer architectures available for multiprocessor systems, one of the most common high performance designs has become NUMA/ccNUMA (Non Uniform Memory Access / cache coherent NUMA). Sun's E25K systems fall within this category, as do most large systems that offer vertical scalability of many independent CPU / Memory boards, acting together and communicating across a shared system interconnect (backplane/centerplane/crossbar).
Previously, local proximity of memory and the latency associated with accessing that memory was uniform and predictable. With the advent of system growth beyond single system (cpu/memory) boards, UMA (Uniform Memory Access) could no longer be guaranteed. One common performance issue that must be addressed within large NUMA environments is the aggregate impact of both physical memory proximity, alongside the gap between processor speed and memory latency (see below). Solaris addressed this issue (with a Solaris 9 update) by introducing MPO (Memory Placement Optimization) that associates memory physically closer to a cpu to minimize the additional cross interconnect memory latency (this couples Solaris and the underlying HW along with Cache Coherency within ccNUMA architectures). Note, other optimizations have been used to address large Sun Enterprise system centerplane latencies, including kernel cage splitting, removal (if DR isn't required), and the introduction of S10 enhancements. <*"busstat" can be used to diagnose these issues*>
Lucky for us, we are quicly approaching system architectures that once again allow for UMA designs, offering memory access predictability (with the T2 and related offerings moving forward, where density of cpu cores/threads brings greater multithreaded capacity within a small footprint). However, the current state of the industry isn't quite as lucky if you look across the bow of the competition, and within our client production environments.
The growing CPU - Memory gap ...
As you can see from the diagram to the left, over
the past
several years as microprocessor design has moved rapidly, doubling the
performance and clockspeed of CPU's roughly every 2 years, the
same increases have NOT been matched within the Memory arena.
This "wait" time that threads of execution must incur, coupled with
the additional latency required for electricity to travel greater
distances to access memory across the system interconnect (not
physically local to a CPU, but rather on another CPU board or memory
bank) can impact processor efficiency and overall system / application
throughput dramatically. (this
slide and the next are from http://www.OpenSparc.net )Chip Multi-Processing + Hardware Multi-Threading = Chip Multi-Threading
Realizing the implications of the memory latency "lag" (and somewhat against the tide of relying upon ever-increasing CPU clock speed increases), several years ago Sun made the decision to address this with an acquisition of Afara Websystems to bolster it's processor lineup and chart the industry's new course toward multi-core CPU's.
From
the diagram on the left, it can be seen that Sun's
adoption of both CMP (US-IV/+), aka Chip level Multi-Processing (having
multiple cores per cpu, each with an execution pipeline), alongside HMT
.. Hardware Multi-Threading (adding a multi-threaded execution pipeline
within a core/cpu), shores up and nearly eliminates the issue of idle
CPU cycles waiting on memory operations (by offering many physical
threads of simultaneous execution within a CPU). This is the
foundation of what Sun calls CMT (Chip Multi-Threading), which is
reflected in Sun's CPU roadmap for both the Niagara (T1 and T2 CPU's),
Rock CPU, as well as the Sun/Fujitsu Olympus CPU (as a follow-on to the
US-IV+).Why so much $cache ?
In order to further minimize system latency and ensure peak performance, modern architectures include Memory Management and CPU memory access mechanisms "on-chip", such as the L1 / L2 cache and MMU located in Sun's CMT products. Below is a block diagram of Sun's latest "T2" SPARC Core Architecture (based on the 64 bit SPARC V9 instruction set).
Critical components for optimal kernel
CPU / memory performance:- Level 1 (L1) Data Cache / Instruction Cache ..
- Level 2 (L2) Secondary Cache, on-chip and shared for Sun
CMT processors
- Level 3 (L3) <only Sun's US-IV+
cpu offers L3 cache on-chip>
- I-TLB (L1 / L2) Instruction -Translation Lookaside Buffers
- D-TLB (L1 / L2) Data -Translation Lookaside Buffers
- Buffer cache (Filesystem Kernel "page cache", taken from
the "free list" of available RAM; this is a kernel structure)
(Intel/AMD) offer 4K pagesizes, while the UltraSparc I through IV... offer 8K / 64K /
512K / 4MB page sizes. (*monitor with the pagesize, trapstat, cpustat, and pmap ..*)
Examining the key attributes of your system's
Workload as Requirements :
Once again, in order to select the appropriate architecture for a
production deployment, the all-inclusive entity
that we need to examine in it's entirety is the "Application
Environment", all of it's subsystems, as well as individual
/concurrent workload characteristics :
- Single Threaded vs. Multi-Threaded Applications (OLTP, DSS, HPC,
Web/AppSvr, .. ?)
- Compute / HPC Intensity (MPI, horizontally scaled compute farm
requirements, ..)
- Network I/O (long-lived connections vs. short-lived; large
vs. small packets, # inbound RX pkts/sec..)
- Memory Intensive Workload (shared / distributed memory req's, etc)
- Storage Workload (R/W %'ages, Cache cfg, # Controller
Interrupts/sec, # Files opened, shared FS,etc.)
- Integer vs. Floating Point Calculations (T1's are not well suited
to FP workloads)
- 32 vs. 64 bit needs/benefits (address space needs beyond 4GB RAM ?)
- Do SLA/SLC reqmt's focus on Throughput, BandWidth (IO/Net/Mem),
Availability, and/or Response Times ?
Choosing the right CPU for your Workload :
Sun
SPARC based CMP / CMT CPU's :
- US-IV+ : Well
suited for very large vertically scaled configurations where the 32MB of L3 cache
makes a big difference, such as DB Servers (many of these environments have large
single threaded processes, including batch/OLTP). Sun's E25K
systems scale up to 72 CPU's per single domain (144 cores).
- Sun / Fujitsu SPARC64 (US-VI
Olympus) : Follow-on to the US-IV+ CPU. (~1.5*
performance of a US-IV+) This CPU should have higher clockspeeds
starting at 2.15GHz, with an additional HW thread/core, but no L3 cache
on-chip. Note that the up-coming CMT ROCK chip from Sun in 2008
will fall within this segment.
- T1 (Niagara 1) :
Well suited for small to medium sized Multi-Threaded Workloads
(that don't have
much /any FP
processing. Best for : WebSvrs, AppSvrs, DNS,
etc..). Each single socket system offers up to 32 HW Threads of
execution.
- T2 (Niagara 2) :
Well suited for medium to large Multi-Threaded Workloads. These
systems should also offer good general purpose computing performance,
given the addition of FGU's per CPU core, along with built-in 10G
Ethernet, etc.. Each 5x20 system presently offers a single socket
with up to 64 HW Threads. Look for multi-socket systems based up
on the T2 cpu in the not so distant future ;)
Sun's "World Class" T2 (Niagara 2) CPU :
At a glance, the new T2 CPU offering from Sun is a true "system on a chip" that lives up to it's reputation as "the worlds fastest processor" (the new world record benchmarks listed further down in this article can attest to the validity of that statement and show how Sun's latest CMT CPU's are changing the landscape of computing efficiency, part of the reason why Sun calls this "CoolThreads" and/or "Throughput" computing).
T2 CPU
Highlights :- 8 cores * 8 Threads each = 64 Threads of Execution
- 65nm, initially running @ 1.4 GHz
- 8 Floating Point / Graphics Units (FGU's, one per core)
- on-chip Crossbar providing : 180 GB/s R + 90GB/s W
- built in 2* 10Gb Ethernet, MMU, Encryption, etc...
The following table provides a high level comparison of Sun's T2 and T1 CPU's :
(for the complete Microprocessor
Review report on the T2, click here)
(For
further details comparing the T2 to
other recent Sun CPU's for #transistors, etc., click here)
(For
photos inside the new Sun T5x20
systems, based upon the T2 cpu, click here)
Solaris kernel (CPU related) Performance Metrics and Utilities :
The following table is only listed as a "high level" sample of common metrics frequently used as part of Solaris CPU-related performance analysis. This is by no means a comprehensive list of metrics available, but rather an introduction for those that aren't familiar with the essentials. An up-coming set of blogs will include much more detailed examples with command line output, also including discussions of kstat and Dtrace visibility available.Note: * vmstat, cpustat, trapstat, intrstat reflect system-wide statistics, while mpstat, cputrack reflect per CPU statistics. *
| Metric |
Description |
Utility |
| Run Queue |
Kernel Threads Runnable,
but not executing (best if 0, or at most
< # cores) |
vmstat (r) |
| Blocked Kthr |
Blocked Kernel
Threads
(typically ID's an
IO bottleneck, see also %wt,
lockstat,..
) |
vmstat (b) |
| System Calls |
Number of System Calls
(calls made into the OS kernel, accounting towards %Sys) |
vmstat (sys) |
| Interrupts |
Number of System
Interrupts per interval (interrupts have the highest priority on the
system) |
vmstat (in) |
| % CPU (U/S/I) |
% CPU utilization (% User
space / % System kernel / % Idle); % User should
be 2* % Sys |
vmstat |
| Cross Calls |
Per CPU Cross-Calls
(either for cross processor interrupts, and/or maintaining cpu virtual
memory translation consistency .. aka cache consistency with MME and
mapping TLB entries, etc.) |
mpstat (xcal) |
| Cpu
Interrupts |
Per CPU Interrupts
(also use intrstat, as well as
lockstat for system
correlation) |
mpstat (intr) |
| Context
Switches |
Involuntary context
switching (icsw reflects preemption..)
vs.
voluntary context switching (csw) |
mpstat (i/csw) |
| CPU
Migrations |
Per Cpu Migrations .. A
more inclusive migration off of and onto another CPU. |
mpstat (migr) |
| Shared Mutex |
Mutex exclusion lock
activity (per cpu) p/lockstat
gives the best visibility of this activity. |
mpstat (smtx) |
| % CPU Waiting |
% of a single CPU spent
Waiting (during the sampling interval). See also b kthr. |
mpstat (%wt) |
| Instr
TLB Misses |
% of MMU related
Instruction Translation Lookaside Buffer Misses (see also pagesize, pmap, cpustat..) |
trapstat -t/T |
| Data
TLB Misses |
% of MMU related Data
Translation Lookaside Buffer Misses (see note above as pgsize is
related) |
trapstat -t/T |
| CPU
Counters |
VARIOUS CPU specific HW
event counters (Cache, Instruction level, FP, TLB; man cputrack for your HW specific
counters available) |
cputrack |
| CPU
Counters |
VARIOUS System Wide CPU
event Counters (man cpustat for
your HW specific counters) |
cpustat |
| BUS
Statistics |
Available System Specific
Bus Device / Instance Counters & Events (use busstat -l
for your HW) |
busstat |
| kernel
Statistics |
ALL kernel statistics are
available individually via kstat
(module:instance:name:class) |
kstat |
**NOTE: if you'd like to try a single Solaris utility that can run in minutes to automate the performance / workload correlation and reporting for you, take a look at sys_diag if you haven't already done so already (or the README). It includes both high-level (vmstat, mpstat, iostat, netstat, kstat, ...) snapshot and analysis, as well as Deep analysis mode which includes extended Dtrace /dexplorer and lockstat probing. (all output is summarized and color coded in an HTML report header/ Dashboard with a Table of Contents for analysis details) **
Common CPU Benchmarks and What they mean :
The following list provides a set of
definitions and examples for some of today's most common independent
(industry accepted) computing benchmarks.
- SPEC CPU2006 : CPU-intensive benchmark suite, stressing a system's processor, memory subsystem and compiler. SPEC designed CPU2006 to provide a comparative measure of compute-intensive performance across the widest practical range of hardware. This benchmark suite includes both the SPEC int_rate2006 and SPEC fp_rate2006 benchmark tests.
|
|
IBM System p570
|
HP ProLiant DL360 G5
|
|
|---|---|---|---|
|
SPECint_rate2006
|
78.5
|
60.9
|
61.3
|
|
SPECfp_rate2006
|
62.3
|
58
|
38.8
|
- SPEC jbb2005 :
SPECjbb2005 (Java Business Benchmark) measures the performance of a
Java implemented application tier (server-side Java). The benchmark
is based on the order processing of a wholesale supplier application.
The metrics given are number of SPECjbb2005 bops (Business Operations
per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).
|
|
IBM p6 570
|
HP 2660
|
Dell 2950
|
|
|---|---|---|---|---|
|
Space (RU)
|
1
|
4
|
2
|
2
|
|
Power Consumption (Watts)
|
464
|
560
|
563
|
300
|
|
Performance (BOPS/JVM)
|
170,153
|
87,737
|
80,884
|
74,218
|
|
Performance / Watt
|
366.7
|
156.7
|
143.7
|
247.4
|
|
SWaP
|
366.7
|
39.2
|
71.8
|
123.7
|
- SPEC jAppServer2004 : SPECjAppServer2004 is the only industry-standard benchmark used for Java Enterprise Edition application servers. In addition to testing application server performance, it also tests the database performance of servers deployed to support the application tier.
|
|
HPrx2660
|
Dell 2900
|
IBMp5+550
|
|
|---|---|---|---|---|
|
Space (RU)
|
1
|
2
|
5
|
4
|
|
Power Consumption (Watts)
|
338
|
559
|
350
|
770
|
|
Performance (SPECjApp JOPS)
|
2,000.92
|
874.17
|
652.95
|
1,197.51
|
|
Performance / Watt
|
5.2
|
1.6
|
1.9
|
1.6
|
|
SWaP
|
5.2
|
0.8
|
0.4
|
0.4
|
- Many, Many other industry standard benchmark tests exist, many of which can be found here at the SPEC benchmark site.
A word on SWaP
Given the state of environmental (global warming concerns), not to
mention Power, Cooling, and floorspace costs, Sun has created the
SWaP metric to compare and reflect the relative performance when taking
into account the "space" (Rack Units), as well as "power
consumption" (Watts).The calculation is : SWaP = Performance (operations or transactions per interval)
Space (RU) x Power Consumption (Watts)
Sun Benchmarks and Comparative Methodology :
For the purpose of internal only comparative benchmarking, Sun provides and maintains(internal-only) results for AMP v.2 and M-value benchmarks.As stated very clearly at both sites (from the URL's noted above) :
Over the past 12 years with Sun, on several occations I've been brought into a mission critical production environment having
performance issues just after going live. The reasons which
were most commonly the cause of this include :
- NOT doing any type of actual Pre-Production Testing on the "target" configuration to be deployed, including :
- NOT doing any sort of formal POC (Proof of Concept) with the target configuration to be purchased or migrated to. A proof of concept is typically not an all-out formal benchmark effort, but would minimally give you the opportunity to run conduct Functional Testing, in addition to some simulated "production like" load tests against the configuration, that could demonstrate that the staged target environment will meet a representative sample of "production like" workload.
- NOT doing formal benchmarking, using a copy of Production data, and simulating actual samples of the most active production workloads (DB queries, Client Access patterns, Network traffic, etc..) with a tool such as LoadRunner.
- While the reason above is nearly always the case, the cause that
frequently crops up in most of these scenarios is that the only "Sizing" done
was to compare the "before" and "after" M-Values to
generate quotes ! I have encountered this even in
mission critical environments, where NO pre-production staging or load testing was done ! This is just WRONG, it's more than a bad practice,
possibly one that could get you fired if your (or your customer's) production
environment
goes down in flames after lots of $$ was spent ! Push
back if necessary !
** Single-Core to Multi-Core (CMT) M-Value ("on paper") Comparisons Should CAUTIOUSLY be evaluated !! **
Regarding the trend of migrating production environments from single-core to multi-core architectures, Beware! Even though a generic benchmark test might reflect much higher #'s with fewer HW threads (and/or cores) than a current production configuration has, realize that there is a lot more to proper sizing and capacity planning (realizing that each configuration is unique) than is reflected in M/GHz (or in M-values) ! A lot can be said for certain types of workload requiring a specific # of HW cores for their environment to perform optimally "on cpu" without much cpu/kernel contention (locking, High TLB misses, context switching, and/or cpu-migrations..).
Hopefully this article has helped you
reflect on the wide variety of CPU options available, as well as how
they play such a significant role as the "cornerstone" of system
architecture and overall performance of our customer's production
environments. Enjoy, and "let the chips fall (or rise) as they
may"... :)
For more information regarding
Performance Analysis, Capacity Planning, and related Tools, see
Todd's Blog at : http://blogs.sun.com/toddjobson/category/Performance+and+Capacity+Planning
* Copyright 2007 Todd A. Jobson *
Posted at 07:33PM Nov 07, 2007 by tjobson in Performance and Capacity Planning | Comments[0]
Wednesday Nov 07, 2007






