Darryl Gove's blog
Libraries (2)
Just updated the ld_dot script to include filter libraries. Added a profile for ssh logging into a system, rather than just showing the help message (click the image for the full size version).
Posted at 11:52AM May 20, 2009 by Darryl Gove in Sun |
Libraries
I was talking to Rod Evans about the diagnostic capabilities available in the runtime linker. These are available through the environment setting LD_DEBUG. The setting LD_DEBUG=files gives diagnostic information about which libraries were loaded by which other libraries. This is rather hard to interpret, and would look better as a graph. It's relatively easy to parse the output from LD_DEBUG into dot format. This script does the parsing. The full stesp to do this for the date command are:
$ LD_DEBUG=files date >ld_date 2>&1 $ ld_dot ld_date $ dot -Tpng -o date.png dot.dot
The lines in the graph represent which libraries use which other libraries. Solid lines indicate "needed" or hard links, the dotted lines represent lazy loading or dynamic loading (dlopen). The resulting graph looks like:
More complex commands like ssh pull in a larger set of libraries:
It is possible to use this on much larger applications. Unfortunately, the library dependencies tend to get very complex. This is the library map for staroffice.
Posted at 06:12AM May 20, 2009 by Darryl Gove in Sun | Comments[4]
Developer's Edge safari rerelease
We've just pushed a new version of The Developer's Edge to safari. The original version didn't show any of the text from each section of the book unless you logged into the safari site. The new version shows the snippet from each section even if you're not a subscriber.
I was pleased to see that the book is featured on the Sun Studio developer portal.
I'm also scheduled to give a second life presentation during JavaOne at 9am PST on the 3rd June.
Posted at 02:01PM May 19, 2009 by Darryl Gove in Sun |
The perils of strlen
Just been looking at an interesting bit of code. Here's a suitably benign version of it:
#include <string.h>
#include <stdio.h>
void main()
{
char string[50];
string[49]='\0';
int i;
int j=0;
for (i=0; i<strlen(string); i++)
{
if (string[i]=='1') {j=i;}
}
printf("%i\n",j);
}
Compiling this bit of code leads to a loop that looks like:
.L900000109:
/* 0x002c 12 */ cmp %i5,49
/* 0x0030 10 */ add %i3,1,%i3
/* 0x0034 12 */ move %icc,%i4,%i2
/* 0x0038 10 */ call strlen ! params = %o0 ! Result = %o0
/* 0x003c */ or %g0,%i1,%o0
/* 0x0040 */ add %i4,1,%i4
/* 0x0044 */ cmp %i4,%o0
/* 0x0048 */ bcs,a,pt %icc,.L900000109
/* 0x004c 12 */ ldsb [%i3],%i5
The problem being that for each character tested there's also a call to strlen! The reason for this is that the compiler cannot be sure what the call to strlen actually returns. The return value might depend on some external variable that could change as the loop progresses.
There's a lot of functions defined in the libraries that the compiler could optimise, if it was certain that it recognised them. The compiler flag that enables recognition of the "builtin" functions is -xbuiltin (which is included in -fast. This enables the compiler to do things like recognise calls to memcpy or memset and in some instances produce more optimal code. However, it doesn't recognise the call the strlen.
In terms of solving the problem, there are two approaches. The most portable approach is to hold the length of the string in a temporary variable:
int length=strlen(string); for (i=0; i<length; i++)
Another, less portable approach, is to use #pragma no_side_effect. This pragma means the return value of the function depends only on the parameters passed into the function. So the result of calling strlen only depends on the value of the constant string that is passed in. The modified code looks like:
#include <string.h>
#include <stdio.h>
#pragma no_side_effect(strlen)
void main()
{
char string[50];
string[49]='\0';
int i;
int j=0;
for (i=0; i<strlen(string); i++)
{
if (string[i]=='1') {j=i;}
}
printf("%i\n",j);
}
And more importantly, the resulting disassembly looks like:
.L900000109:
/* 0x0028 0 */ sub %i1,49,%o7
/* 0x002c 11 */ add %i3,1,%i3
/* 0x0030 0 */ sra %o7,0,%o5
/* 0x0034 13 */ movrz %o5,%i4,%i2
/* 0x0038 11 */ add %i4,1,%i4
/* 0x003c */ cmp %i4,%i5
/* 0x0040 */ bcs,a,pt %icc,.L900000109
/* 0x0044 13 */ ldsb [%i3],%i1
Posted at 02:15PM May 07, 2009 by Darryl Gove in Sun | Comments[2]
Computer organization and design
Just returned from Europe - customer visits and the OpenSPARC workshop in Brussels. Since I was doing a fair amount of air travel I took a number of books. With good timing, I'd just got a copy of Patterson & Hennessy's Computer Organisation and Design.
The book is certainly an interesting read. Although there are various ways you might read the book - or various lessons you might extract from it, I took it as basically a book that describes how to build a MIPS processor. Much of the work I do is at the hardware-software border, so I found it useful to read around the domain a bit more. The book is a relatively easy read, lots of detail (which I didn't need to memorise), and a nice pace that builds up from a simple core into a more complete implementation of a processor. The book has a chapter on multicore, but this was not treated to the same depth. There's also some material about GPUs, and that also didn't fit very well.
Posted at 01:28PM Apr 30, 2009 by Darryl Gove in Sun |
Compiling for Nehalem and other processors
First thing I'd suggest is to make sure that you're using a recent compiler. Practically that means Sun Studio 12 (which is quite old now), or the Sun Studio Express releases (aka Sun Studio 12 Update 1 Early Access). Obviously the latest features are only found in the latest compilers.
In terms of flags, the starting point should be -fast, you can later trim that if you need to remove floating point simplification, or are not happy with one or other of the flags that it includes.
The next flags to use are:
-xipo=2This flag enables crossfile optimisation and tracking of allocated memory. I find crossfile optimisation useful because it limits the impact of the source code structure on the final application.-xarch=sse4_2This flag is included in-fast(assuming that the build system supports it). However, if you later plan to fine tune the compiler flags, it's best to start off with it explicitly. This allows the compiler to use the SSE4 instruction set. Probably-xarch=sse2will be sufficient in most circumstances - it's a call depending on the system that the application will be deployed on.-xvector=simdThis flag tells the compiler to generate SIMD (single instruction multiple data) instructions - basically the combination of this and the architecture flag enables the compiler to generate applications that use SSE instructions. These instructions can lead to substantial performance gains in some floating point applications.-m64On x86 there's performance to be gained from using the 64-bit instruction set extensions. The code gets more registers and a better calling convention. These tend to outweigh the costs of the larger memory footprint of 64-bit applications.-xpagesize=2MThis tells the operating system to provide large pages to the application.- The other optimisation that I find very useful is profile feedback. This does complicate and lengthen the build process, but is often the most effective way of getting performance gains for codes dominated by branches and conditional code.
- The other flags to consider are the aliasing flags
-xalias_level=stdfor C,-xalias_level=compatiblefor C++, and-xrestrict. These flags do lead to performance gains, but require the developer to be comfortable that their code does conform to the requirements of the flags. (IMO, most code does.)
All this talk about flags should not be a replacement for what I consider to be the basic first step in optimising the performance of an application: to take a profile. Compiler flags tell the compiler how to do a good job of producing the code, but the compiler can't do much about the algorithms used. Profiling the application will often give a clue as to a way that the performance of the application can be improved by a change of algorithm - something that even the best compiler flags can't always do.
Posted at 04:41PM Apr 17, 2009 by Darryl Gove in Sun |
The Developer's Edge - Hardcopy updated
The cover art for The Developer's Edge has been updated. You can see this on the Amazon picture, which now agrees with the one on the left of this blog.
Posted at 12:00AM Apr 03, 2009 by Darryl Gove in Sun |
Strong prefetch
Looked at an interesting problem yesterday. One of the hot routines in an application was showing a lot of system time. The overall impact on the application was minor, but any unexpected system time is worth investigating.
Profiling under the performance analyzer indicated a pair of instructions, a prefetch followed by a load. The load had all the time attributed to it, but it's usual for the performance analyzer to attribute time to the instruction following the one that is slow. Something like (this is synthetic data, measured data later):
User System 0.5 0.5 prefetch [%o0],23 10.0 20.0 ld [%i1],%i2
So there are three things that sprang to mind:
- Misaligned memory accesses. This would probably be the most common cause of system time on memory operations. However, it didn't look like a perfect file since the system time would usually be reported on the instruction afterwards.
- TSB misses. The TLB is the on-chip structure that holds virtual to physical mappings. If the mapping is not in the TLB, then the mapping gets fetched from a software managed structure called the TSB. This is a TLB miss. However, the TSB has a capacity limit too, and it is possible for there to be a TSB miss where the mapping is fetched from memory and placed into the TSB. TLB misses are fast traps and don't get recorded under system time. However, TSB misses do cause the switch. This is quite a strong candidate, however trapstat -t didn't show much activity.
- The third option was the prefetch instruction.
A prefetch instruction requests data from memory in advance of the data being used. For many codes this results in a large gain in performance. There's a number of variants of prefetch, and their exact operation depends on the processor. The SPARC architecture 2007 on pages 300-303 gives an outline of the variants. There's prefetch to read, prefetch to write, prefetch for single use, prefetch for multiple use. The processor has the option to do something different depending on which particular prefetch was used to get the data.
The system I was running on was an UltraSPARC IV+. This introduced the strong prefetch variant.
The idea of the strong prefetch variant is that it will cause a TLB miss to be taken if the mapping of the requested address is not present in the TLB. This is useful for situations where TLB misses are frequently encountered and the prefetch needs to be able to deal with them. To explain why this is important, consider this example. I have a loop which strides through a large range of memory, the loop optimally takes only a few cycles to execute - so long as the data is resident in cache. It takes a couple of hundred cycles to fetch data from memory, so I end up fetching for eight iterations ahead (eight is the maximum number of outstanding prefetches on that processor). If the data resides on a new page, and prefetches do not cause TLB misses, then eight iterations will complete before a load or store touches the new page of memory and brings the mapping into the TLB. The eight prefetches that were issued for those iterations will have been dropped and all the iterations will see the full memory latency (a total of 200*8=1600 cycles of cost). However, if the prefetches are strong prefetches, then the first strong prefetch will cause the mapping to be brought into the TLB, and all the prefetches will hit in the TLB.
However, there's a corner case. A strong prefetch of a page that is not mapped will still cause the TLB miss, and the corresponding system time while the kernel figures drops the access to a non-existant page.
This was my suspicion. The code had a loop, and each iteration of the loop would prefetch the next item in the linked list. If the code had reached the end of the list, then it would be issuing a prefetch for address 0, which (of course) is not mapped.
It's relatively easy to provide a test case for this, just loop around issuing prefetches for address zero and see how long the code takes to run. Prefetches are defined in the header file <sun_prefetch.h>. So the first code looks like:
#include <sun_prefetch.h>
void main()
{
for (int i=0; i<1000000; i++)
{
sparc_prefetch_write_many(0);
}
}
Compiling and running this gives:
$ cc pref.c $ timex a.out real 2.59 user 1.24 sys 1.34
So the code has substantial system time - which fits with the profile of the application. Next thing to look at is the location of that system time:
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
Excl. Excl.
User CPU Sys. CPU
sec. sec.
...
0. 0. [?] 10b8c: bset 576, %o1 ! 0xf4240
0. 0. [?] 10b90: prefetch [0], #n_writes_strong
## 1.571 1.181 [?] 10b94: inc %i5
0. 0. [?] 10b98: cmp %i5, %o1
...
So the system time is recorded on the instruction following the prefetch.
The next step is to investigate ways to solve it. The <sun_prefetch.h> header file does not define any variants of weak prefetch. So it's necessary to write an inline template to provide the support:
.inline weak,4 prefetch [%o0],2 .end .inline strong,4 prefetch [%o0],22 .end
The identifier for many writes strong is #22, for many writes weak it is #2. The initial code uses strong since I want to validate that I get the same behaviour with my new code:
void weak(int*);
void strong(int*);
void main()
{
for (int i=0; i<1000000; i++)
{
strong(0);
}
}
Running this gives:
$ cc pref.c pref.il
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
Excl. Excl.
User CPU Sys. CPU
sec. sec.
...
0. 0. [?] 10b90: clr %o0
0. 0. [?] 10b94: nop
0. 0. [?] 10b98: prefetch [%o0], #n_writes_strong
## 1.761 1.051 [?] 10b9c: inc %i5
...
So that looks good. Replacing the strong prefetch with a weak prefetch:
void weak(int*);
void strong(int*);
void main()
{
for (int i=0; i<1000000; i++)
{
weak(0);
}
}
And then repeat the profiling experiment:
Excl. Excl.
User CPU Sys. CPU
sec. sec.
...
0. 0. [?] 10b90: clr %o0
0. 0. [?] 10b94: nop
0. 0. [?] 10b98: prefetch [%o0], #n_writes
0. 0. [?] 10b9c: inc %i5
...
Running the code under timex:
$ timex a.out real 0.01 user 0.00 sys 0.00
So that got rid of the system time. Of course with this change the prefetches issued will now be dropped if there is a TLB miss, and it is possible that could cause a slowdown for the loads in the loop - which is a trade-off. However, examining the application using pmap -xs <pid> indicated that the data resided on 4MB pages, so the number of TLB misses should be pretty low (this was confirmed by the trapstat data I gathered earlier).
One final comment is that this behaviour is platform dependent. Running the same strong prefetch code on an UltraSPARC T2, for example, shows no system time.
Posted at 11:03AM Apr 02, 2009 by Darryl Gove in Sun |
NUMA, binding, and OpenMP
One of my colleagues did an excellent bit of analysis recently, it pulls together a fair number of related topics, so I hope you'll find it interesting.
We'll start with NUMA. Non-Uniform Memory Access. This is in contrast to UMA - Uniform Memory Access. This relates to memory latency - how long does it take to get data from memory to the processor. If you take a single CPU box, the memory latency is basically a measurement of the wires between the processor and the memory chips, it typically is about 90ns, can be as low as 60ns. For a 3GHz chip this is from around 200 to 300 cycles, which is a fair length of time.
Suppose we add a second chip into the system. The memory latency increases because there's now a bunch of communication that needs to happen between the two chips. The communication consists of things like checking that more recent data is not in the cache of the other chip, co-ordinating access to the same memory bank, accessing memory that is controlled by the other processor. The upshot of all this is that memory latency increases. However, that's not all.
If you have two chips together with a bunch of memory, you can have various configurations. The most likely one is that each chip gets half the memory. If one chip has to access memory that the other chip owns, this is going to take longer than if the memory is attached to that chip. Typically you might find that local memory takes 90ns to access, and remote memory 120ns.
One way of dealing with this disparity is to interleave the memory, so one cacheline will be local, the next remote. Doing this you'll see an average memory latency of 105ns. Although the memory latency is longer than the optimal, there's nothing a programmer (or an operating system) can do about it.
However, those of use who care about performance will jump through hoops of fire to get that lower memory latency. Plus as the disparity in memory latency grows larger, it makes less and less sense to average the cost. Imagine a situation on a large multi-board system where the on-board memory latency might be 150ns, but the cross-board latency would be closer to 300ns (I should point out that I'm using top-of-the-head numbers for all latencies, I'm not measuring them on any systems). The impact of doing this averaging could be a substantial slow-down in performance for any application that doesn't fit into cache (which is most apps). (There are other reasons for not doing this, such as limiting the amount of traffic that needs to go across the various busses.)
So most systems with more than one CPU will see some element of NUMA. Solaris has contained some memory placement optimisations MPO since Solaris 9. These optimisations attempt to allocate memory locally to the processor that is running the application. OpenSolaris has the lgrpinfo command that provides an interface to see the levels of memory locality in the system.
Solaris will attempt to schedule threads so that they remain in their locality group - taking advantage of the local memory. Another way of controlling performance is to use binding to keep processes, or threads on a particular processor. This can be done through the pbind command. Processor sets can performance a similar job (as can zones, or even logical domains), or directly through processor_bind.
Binding can be a tricky thing to get right. For example in a system where there are multiple active users, it is quite possible to end up in a situation where one virtual processor is oversubscribed with processes, whilst another is completely idle. However, in situations where this level of control enables better performance then binding can be hugely helpful.
One situation where binding is commonly used is for running OpenMP programs. In fact, it is so common that the OpenMP library has built in support for binding through the environment variable SUNW_MP_PROCBIND. This variable enables the user to specify which threads are bound to which logical processors.
It is worth pointing out that binding does not just help memory locality issues. Another situation where binding helps is thread migrations. This is the situation where an interrupt, or another thread requires attention and this causes the thread currently running on the processor to be descheduled. In some situations the descheduled thread will get scheduled onto another virtual processor. In some instances that may be the correct decision. In other instances it may result in lower than expected performance because the data that the thread needs is still in the cache on the old processor, and also because the migration of that thread may cause a cascade of migrations of other threads.
The particular situation we hit was that one code when bound showed bimodal distributions of runtimes. It had a 50% chance of running fast or slow. We were using OpenMP as well as the SUNW_MP_PROCBIND environment variable, so in theory we'd controlled for everything. However, the program didn't hit the parallel section until after a few minutes of running, and examining what was happening using both pbind and also the Performance Analyzer indicated what the problem was.
The environment variable SUNW_MP_PROCBIND currently binds threads once the code reaches the parallel region. Until that point the process is unbound. Since the process is unbound, Solaris can schedule it to any available virtual CPU. During the unbound time, the process allocated the memory that it needed, and the MPO feature of Solaris ensured that the memory was allocated locally. I'm sure you can see where this is heading.... Now, once the code hit the parallel region, the binding occurred, and the main thread was bound to a particular locality group, half the time this group was the same group where it had been running before, and half the time it was a different locality group. If the locality group was the same, then memory would be local, otherwise memory would be remote (and performance slower).
We put together a LD_PRELOAD library to prove it. The following code has a parallel section in it which gets called during initialisation. This ensures that binding has already taken place by the time the master thread starts.
#include <stdio.h>
#pragma init(s)
void s()
{
#pragma omp parallel sections
{
#pragma omp section
{
printf("Init");
}
}
}
The code is compiled and used with:
$ cc -O -xopenmp -G -Kpic -o par.so par.c $ LD_PRELOAD=./par.so ./a.out
Posted at 12:08AM Apr 01, 2009 by Darryl Gove in Sun |
The Developer's Edge available in hardcopy
The Developer's Edge is now available as hardcopy!
It is being made available as print-on-demand. You can either order through the publisher Vervante, or you can order through Amazon.
However, I suggest you wait until next week before ordering as the current cover art is not the final version (you can play spot the difference between the image on the left and the one on the Vervante website). I'll post when it gets fixed. Of course, you can order the "limited-edition" version if you want 
I introduced the book in a previous post. I'll reproduce a bit more of the details in this post. The brief summary is:
The Developer's Edge: Selected Blog Posts and Articles focuses on articles in the following areas:
- Native language issues
- Performance and improving performance
- Specific features of the x86 and SPARC processors
- Tools that are available in the Solaris OS
The articles provide a broad overview on a topic, or an in-depth discussion. The texts should provide insights into new tools or new methods, or perhaps the opportunity to review a known domain in a new light.
You can get more details than this from the Safari site, either reading the preface or skimming the table of contents
I would love to hear feedback on this book, feel free to e-mail me directly, leave comments on amazon, or leave comments on this blog, or on the blogs of the other contributors.
Posted at 08:00AM Mar 27, 2009 by Darryl Gove in Sun |
When to add a membar (an example)
I was recently having a discussion on one of the OpenSolaris lists on the topic of when to use the volatile keyword, and when it was necessary to use membars.
So volatile is a clue to the compiler to load the variable from memory and immediately store it back to memory. What it does not do is to tell the hardware anything. So the application can perform the store, but that store may not be immediately visible to the rest of the system. Most of the time this is not a problem - so long as the store is visible to the processor on which the thread is executing it's fine. Variability of when the store is visible to other processors may also be fine. There is one clear situation where the ordering of store operations could be a problem - and that's unlocking mutexes.
The problem here is best illustrated by the following scenario. I lock some data structure, then store new values into it, then unlock the structure. Immediately another thread comes along and uses the values in that structure. Not an uncommon situation. Unlocking a mutex is often just a case of storing a value (of zero) into the mutex structure. And here's the potential problem. In some weaker ordering architectures there is no guarantee that other processors see the stores in the same order that they are performed. So if you have Store A followed by Store B it may be possible for other processors to observe the change in the value of B before they see the change in the value of A. In the case of mutex unlock, the store of B would be the action that unlocked the mutex, enabling other threads to access the variable A... and there could be problems if they see the old value of A.
The solution to this is to put a membar in before the store that unlocks the mutex. You can see this happening in the OpenSolaris code:
41 /*
42 * lock_clear(lp)
43 * - clear lock.
44 */
45 ENTRY(_lock_clear)
46 membar #LoadStore|#StoreStore
47 retl
48 clrb [%o0]
49 SET_SIZE(_lock_clear)
The membar ensures that all the pending stores are visible to other processors before the store that releases the lock becomes visible to them.
Posted at 12:00AM Mar 27, 2009 by Darryl Gove in Sun | Comments[2]
OpenSPARC workshop - Brussels
April 24th and 25th I'm going to be in Brussels rerunning the OpenSPARC workshop. The workshop leverages the material in the OpenSPARC Internals book, together with the OpenSPARC presentations. There's a poster with all the details (for those with acute eyesight, the poster on the left is from the December workshop!).
Posted at 09:08PM Mar 26, 2009 by Darryl Gove in Sun |
University of Washington Presentation
I was presenting at the University of Washington, Seattle, on Wednesday on Solaris and Sun Studio. The talk covers the tools that are available in Solaris and Sun Studio. This is my "Grand Unified" presentation, it covers tools, compilers, optimisation, parallelisation, and debug.
Posted at 02:08PM Mar 20, 2009 by Darryl Gove in Sun | Comments[2]
Always use the latest firmware
Steve Sistare has an excellent write up of a scaling issue that we hit last year. The issue was frustrating because all the tools seemed to indicate a healthy system, but we were just not getting the scaling that we expected. The solution, as Steve writes, was a firmware update. Which was great - the problem was solved - but frustrating because we could have just started by updating the firmware.... but it's not something you always think of!
Posted at 01:03PM Mar 20, 2009 by Darryl Gove in Sun |
Volatile and Mutexes
A rough-and-ready guide to when to use the volatile keyword and when to use mutexes.
When you declare a variable to be volatile it ensures that the compiler always loads that variable from memory and immediately stores it back to memory after any operation on it.
For example:
int flag;
while (flag){}
In the absence of the volatile keyword the compiler will optimise this to:
if (!flag) while (1) {}
[If the flag is not zero to start with then the compiler assumes that there's nothing that can make it zero, so there is no exit condition on the loop.]
Not all shared data needs to be declared volatile. Only if you want one thread to see the effect of another thread.
[Example, if one thread populates a buffer, and another thread will later read from that buffer, then you don't need to declare the contents of the buffer as volatile. The important word being later, if you expect the two threads to access the buffer at the same time, then you would probably need the volatile keyword]
Mutexes are there to ensure exclusive access:
You will typically need to use them if you are updating a variable, or if you are performing a complex operation that should appear 'atomic'
For example:
volatile int total; mutex_lock(); total+=5; mutex_unlock();
You need to do this to avoid a data race, where another thread could also be updating total:
Here's the situation without the mutex:
Thread 1 Thread 2 Read total Read total Add 5 Add 5 Write total Write total
So total would be incremented by 5 rather than 10.
An example of a complex operation would be:
mutex_lock(); My_account = My_account - bill; Their_account = Their_account + bill; mutex_unlock();
You could use two separate mutexes, but then there would be a state where the amount bill would have been removed from My_account, but not yet placed into Their_account (this may or may not be a problem).
Posted at 10:48AM Mar 20, 2009 by Darryl Gove in Sun |
Sun Studio 12 Update 1 Early Access programme
We've just released Sun Studio Express 03/09. We're also using this release to start the Sun Studio 12 Update 1 Early Access Programme.
There are a bunch of new features in Sun Studio 12 Update 1 - you should be familiar with these if you've already been using the Express releases, but if you're coming from Sun Studio 12, there's much to recommend the new release. Top of my list are the following:
- spot is finally integrated into the suite - please try it out and let me know how it works out for you.
- The Performance Analyzer has support for MPI profiling
- Full support for OpenMP 3.0
If you join the Early Access programme, we'll be listening out in case you hit any bugs, or have any suggestions for improvements. There's a forum for posting questions for the duration of the EA programme.
There's also a number of incentives for registering.
Posted at 11:00AM Mar 18, 2009 by Darryl Gove in Sun |
Welcome to "The Developer's Edge"
Late Sunday night I got news that my new book "The Developer's Edge" was available on Safari. This has been a project I've worked on for about six months - so a much quicker turnaround than "Solaris Application Programming".
The Developer's Edge is a collection of blog posts and articles. My contribution to the material is from either this blog, or the articles that I link from the left column. However, it also contains a lot of material that other people have written, and that I have found interesting. The original plan came out of some informal discussions with the docs folks around this time last year. We decided to try and capture some of the vast stream of material that goes out through the various sun.com websites (blogs.sun.com, developers.sun.com, wikis.sun.com etc.) - these are kind-of "Edge" publications - hence the title.
The book will also be available as print-on-demand. We're pushing the material through the channels as fast as we can, so the "launch" is asynchronous! I'll announce the print version when it happens.
Posted at 08:00AM Mar 17, 2009 by Darryl Gove in Sun | Comments[1]
Peak rate of window spill and fill traps
I've been looking at the performance of a code recently. Written in C++ using many threads. One of the things with C++ is that as a language it encourages developers to have lots of small routines. Small routines lead to many calls and returns; and in particular they lead to register window spills and fills. Read more about register windows. Anyway, I wondered what the peak rate of issuing register window spills and fills was?
I'm going to use some old code that I used to examine the cost of library calls a while back. The code uses recursion to reach a deep call depth. First off, I define a couple of library routines:
extern int jump2(int count);
int jump1(int count)
{
count--;
if (count==0)
{
return 1;
}
else
{
return 1+jump2(count);
}
}
and
int jump1(int count);
int jump2(int count)
{
count--;
if (count==0)
{
return 1;
}
else
{
return 1+jump1(count);
}
}
I can then turn these into libraries:
$ cc -O -G -o libjump1.so jump1.c $ cc -O -G -o libjump2.so jump2.c
Done in this way, both libraries have hanging dependencies:
$ ldd -d libjump1.so
symbol not found: jump2 (./libjump1.so)
So the main executable will have to resolve these. The main executable looks like:
#include <stdio.h>
#define RPT 100
#define SIZE 600
extern int jump1(int count);
int main()
{
int index,count,links,tmp;
tmp=jump1(100);
#pragma omp parallel for default(none) private(count,index,tmp,links)
for(links=1; links<10000; links++)
{
for (count=0; count<RPT; count++)
{
for (index=0;index<SIZE;index++)
{
tmp=jump1(links);
if (tmp!=links) {printf("mismatch\n");}
}
}
}
}
This needs to be compiled in such a way as to resolve the unresolved dependencies
$ cc -xopenmp=noopt -o par par.c -L. -R. -ljump1 -ljump2
Note that I'm breaking rules again by making the runtime linker look for the dependent libraries in the current directory rather than use $ORIGIN to locate them relative to the executable. Oops.
I'm also using OpenMP in the code. The directive tells the compiler to make the outer loop run in parallel over multiple processors. I picked 10,000 for the trip count so that the code would run for a bit of time, so I could look at the activity on the system. Also note that the outer loop defines the depth of the call stack, so this code will probably cause stack overflows at some point, if not before. Err...
I'm compiling with -xopenmp=noopt since I want the OpenMP directive to be recognised, but I don't want the compiler to use optimisation, since if the compiler saw the code it would probably eliminate most of it, and that would leave me with nothing much to test.
The first thing to test is whether this generates spill fill traps at all. So we run the application and use trapstat to look at trap activity:
vct name | cpu13 ---------------------------------------- ... 84 spill-user-32 | 3005899 ... c4 fill-user-32 | 3005779 ...
So on this 1.2GHz UltraSPARC T1 system, we're getting 3,000,000 traps/second. The generated code is pretty plain except for the save and restore instructions:
jump1()
21c: 9d e3 bf a0 save %sp, -96, %sp
220: 90 86 3f ff addcc %i0, -1, %o0
224: 12 40 00 04 bne,pn %icc,jump1+0x18 ! 0x234
228: 01 00 00 00 nop
22c: 81 c7 e0 08 ret
230: 91 e8 20 01 restore %g0, 1, %o0
234: 40 00 00 00 call jump2
238: 01 00 00 00 nop
23c: 81 c7 e0 08 ret
240: 91 ea 20 01 restore %o0, 1, %o0
So you can come up with an estimate of 300 ns/trap.
The reason for using OpenMP is to enable us to scale the number of active threads. Rerunning with 32 threads, by setting the environment variable OMP_NUM_THREADS to be 32, we get the following output from trapstat:
vct name | cpu21 cpu22 cpu23 cpu24 cpu25 cpu26 ------------------------+------------------------------------------------------- ... 84 spill-user-32 | 1024589 1028081 1027596 1174373 1029954 1028695 ... c4 fill-user-32 | 996739 989598 955669 1169058 1020349 1021877
So we're getting 1M traps per thread, with 32 threads running. Let's take a look at system activity using vmstat.
vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs us sy id ... 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3022 427 812 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3020 428 797 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 2945 457 760 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3147 429 1025 99 1 0 0 0 0 64800040 504168 0 15 0 0 0 0 0 0 0 0 0 3049 666 820 99 1 0 0 0 0 64800040 504168 0 1 0 0 0 0 0 0 0 0 0 3044 543 866 100 0 0 0 0 0 64800040 504168 0 0 0 0 0 0 0 0 0 0 0 3021 422 798 100 0 0
So there's no system time being recorded - the register spill and fill traps are fast traps, so that's not a surprise.
One final thing to look at is the instruction issue rate. We can use cpustat to do this:
2.009 6 tick 63117611 2.009 12 tick 69622769 2.009 7 tick 62118451 2.009 5 tick 64784126 2.009 0 tick 67341237 2.019 17 tick 62836527
As might be expected from the cost of each trap, and the sparse number of instructions between traps, the issue rate of the instructions is quite low. Each of the four threads on a core is issuing about 65M instructions per second. So the core is issuing about 260M instructions per second - that's about 20% of the peak issue rate for the core.
If this were a real application, what could be done? Well, obviously the trick would be to reduce the number of calls and returns. At a compiler level, that would using flags that enable inlining - so an optimisation level of at least -xO4; adding -xipo to get cross-file inlining; using -g0 in C++ rather than -g (which disables front-end inlining). At a more structural level, perhaps the way the application is broken into libraries might be changed so that routines that are frequently called could be inlined.
The other thing to bear in mind, is that this code was designed to max out the register window spill/fill traps. Most codes will get nowhere near this level of window spills and fills. Most codes will probably max out at about a tenth of this level, so the impact from register window spill fill traps at that point will be substantially reduced.
Posted at 03:31PM Mar 05, 2009 by Darryl Gove in Sun | Comments[2]
Reporting bugs
I figured I'd just write up some quick notes about how to report bugs, if you find them, against Sun Studio. There are various mechanisms.
- First off, there are formal support channels. These are really for the situations where you need a timely response, or are using an older version.
- The next place that can be useful are the Sun Studio forums. We do read these, and reply - they are very active. It can be useful to discuss problems here in case anyone has hit the same issue, or if you are uncertain whether the code or the compiler is at fault.
- There is also a formal bug/rfe reporting service. If the compiler dies with an error, this is a good place to file that. Before reporting a new bug or rfe you should first take a look to see if someone else has already reported it. There's also support for watching the status of bugs, or voting for your top three (although the idea of a favourite bug is a bit disturbing).
In terms of what is useful to include:
- Platform (chip, system type etc)
- OS version
- Compiler version
- Compiler flags/commandline
- symptoms etc
- Call stack [dbx - core; whereami]
If the problem causes the compiler to crash with an error, then to reproduce the problem we'll need the source file. Obviously this depends on your environment and header files etc. To produce a preprocessed source file remove any -o <file.o> from the command line and add -P. -P will produce a preprocessed file <file.i>. Check that compiling <file.i> produces the same error, and submit that with the bug report. Obviously, don't submit files that you don't want anyone else to see!
Well, I hope that you never need this info!
Posted at 01:47PM Mar 05, 2009 by Darryl Gove in Sun | Comments[1]
Relocation Errors
ld.so.1: prog: fatal: relocation error: file ./libfoo.so.1:
symbol bar: referenced symbol not found
Program prog uses libfoo.so.1, and that library has an unresolved dependency on the symbol bar. You can check for this problem using:
$ ldd -d prog
As outlined in the linker guide
Don't, what ever you do, solve it using LD_LIBRARY_PATH!
Posted at 11:56AM Mar 02, 2009 by Darryl Gove in Sun | Comments[9]
Downloading Sun Studio documentation
It's possible to download the entire set of documentation for Sun Studio. These are both the compiler manuals, plus some of the technical articles. There are some other bundles of docs which may also be worth having locally. The Solaris 10 Developer docs, the Solaris 10 Reference docs, and the Solaris 10 System Administrator docs.
Posted at 10:47AM Mar 02, 2009 by Darryl Gove in Sun |
Maximising application performance
I was asked to provide the material for the 2008-2009 Techdays session "Maximising Application Performance". I recorded this as a presentation back last year, and it's now available through SDN. The talk covers basic compiler and profiling material, and is a relatively short 37 minutes in duration.
Posted at 10:53AM Feb 25, 2009 by Darryl Gove in Sun | Comments[2]
OpenSolaris Bible
I've just finished reading The OpenSolaris Bible. At just over 1,000 pages it's very fortunate that I had a couple of plane journeys during which to read it. The book is six parts:
- Introduction to OpenSolaris
- Using OpenSolaris
- Files ystems, networking, and security
- Reliability, availability, and serviceability
- Virtualisation
- Developing and deploying on OpenSolaris
At its core the book is about using OpenSolaris. It has sections on developing and packaging applications for OpenSolaris, but these are brief in comparison to the rest of the content. There's a lot of content on things like setting up Zones, or configuring ZFS file systems. All of it really useful to have around as a reference. There were also some really nice tidbits of information sprinkled through the book (such as the reason for ypinit), which livened up the text.
Anyway, it's definitely something I'm glad to give shelf space to.
Posted at 04:24PM Feb 24, 2009 by Darryl Gove in Sun |
OSUM presentation: Multithreaded programming for CMT systems
My 8am PST presentation for OSUM seemed to go well. The slides from the presentation are available. The presentation can be streamed from the elliminate website..
There's a number of OSUM presentations available, the full list is on the site (registration required).
Posted at 12:06PM Jan 30, 2009 by Darryl Gove in Sun |
Tying the bell on the cat
Diane Meirowitz has finally written the document that many of us have either thought about writing, or wished that someone had already written. This is the document that maps gcc compiler flags to Sun Studio compiler flags.
Posted at 01:07PM Jan 28, 2009 by Darryl Gove in Sun |
OSUM presentation on multi-threaded coding
I'll be giving a presentation titled "Multi-threaded coding for CMT processors" to OSUM members next friday (8am PST). If you are an OSUM member you can read the details here. OSUM stands for Open Source University Meetup - the definition is
"OSUM (pronounced "awesome") is a global community of students that are passionate about Free and Open Source Software (FOSS) and how it is Changing (Y)Our World. We call it a "Meetup" to encourage collaboration between student groups to create an even stronger open source community.".
Posted at 03:57PM Jan 23, 2009 by Darryl Gove in Sun | Comments[3]
Out of memory in the Performance Analyzer
I've been working on an Analyzer experiment from a long running multithreaded application. Being MT I really needed to see the Timeline view to make sense of what was happening. However, when I switched to the Timeline I got a Java Out of Memory error (insufficient heap space).
Tracking this down, I used prstat to watch the Java application run and the memory footprint increase. I'd expected it to get to 4GB and die at that point, so I was rather surprised when the process was only consuming 1.1GB when the error occurred.
I looked at the commandline options for the Java process using pargs, and spotted the flag -Xmx1024m; which sets the max memory to be 1GB. Ok, found the culprit. You can use the -J
$ analyzer -J-Xmx4096m test.1.er
If you need more memory than that, you'll have to go to the 64-bit JVM, and allocate an appropriate amount of memory:
$ analyzer -J-d64 -J-Xmx8192m test.1.er
Posted at 03:17PM Jan 16, 2009 by Darryl Gove in Sun | Comments[2]
Debugging inline templates with dbx
Been working on inline templates to improve the performance on a couple of hot routines in a customer code. I've a couple of articles on this kind of work if you want to find out more details. There's an introductory article which covers the rules, and there's an article specifically talking about using VIS instructions.
Anyway, one of the most important things to do is to write a test harness, it's very easy to make a mistake and have the template not work for some particular situation. For these routines, one of my colleagues had already written a test harness. I ended up extending it to try a different corner case, and at that point discovered that my code no longer validated. The problem turned out to be a branch that should have been branch >= 2 and I'd coded branch != 2. The original test cases terminated with the value 2 at this point, but the new test I added ended up with the value 1, which still should have terminated, but the inline template as written didn't handle it correctly.
So I fired up dbx to take a look at what was going on:
$ cc -g test.c test.il $ dbx a.out Reading a.out Reading ld.so.1 Reading libc.so.1 (dbx) stop at 150 (dbx) run stopped in main at line 150 in file "test.c" 150 res1=campare(&buff1[j],buff2,i);
The stop at <line> command tells the debugger to stop at the problem line number (more details). However, the problem actually occurred when j was equal to 1. So I really should specify the break point better (more details).
(dbx) status *(2) stop at "mcmp-test-all.c":150 (dbx) delete 2 (dbx) stop at 150 -if j==1 (3) stop at "mcmp-test-all.c":150 -if j == 1 (dbx) run Running: a.out (process id 14983)
That got me to the point where the problem occurred. My initial thought was to step through the execution of the inline template using the nexti command. However, this is pretty inefficient:
(dbx) nexti stopped in main at 0x00011cfc 0x00011cfc: main+0x1394: sll %l0, 1, %l1 (dbx) nexti stopped in main at 0x00011d00 0x00011d00: main+0x1398: add %l3, %l1, %l0 (dbx) nexti stopped in main at 0x00011d04 0x00011d04: main+0x139c: ld [%fp - 1044], %l1
It could take quite a large number of instructions before I actually encountered the problem code. Plus each step takes three lines on screen. However, there's a tracei command which traces the execution at the assembly code level (more details).
(dbx) tracei next (dbx) cont 0x00011d08: main+0x13a0: mov %l0, %o0 0x00011d0c: main+0x13a4: mov %l2, %o1 0x00011d10: main+0x13a8: mov %l1, %o2 0x00011d14: main+0x13ac: nop
The output took me through the code, and knowing the code path I had expected, I could pretty easily see the branch that caused the code to diverge.
Posted at 02:26PM Dec 23, 2008 by Darryl Gove in Sun |
OpenSPARC Internals available on Amazon
OpenSPARC Internals is now available from Amazon. As well as print-on-demand from lulu, and as a free (after registration) download.
Posted at 03:19PM Dec 16, 2008 by Darryl Gove in Sun |
How to learn SPARC assembly language
Got a question this morning about how to learn SPARC assembly language. It's a topic that I cover briefly in my book, however, the coverage in the book was never meant to be complete. The text in my book is meant as a quick guide to reading SPARC (and x86) assembly, so that the later examples make some kind of sense. The basics are the instruction format:
[instruction] [source register 1], [source register 2], [destination register]
For example:
faddd %f0, %f2, %f4
Means:
%f4 = %f0 + %f2
The other thing to learn that's different about SPARC is the branch delay slot. Where the instruction placed after the branch is actually executed as part of the branch. This is different from x86 where a branch instruction is the delimiter of the block of code.
With those basics out the way, the next thing to do would be to take a look at the SPARC Architecture manual. Which is a very detailed reference to all the software visible implementation details.
Finally, I'd suggest just writing some simple codes, and profiling them using the Sun Studio Performance Analyzer. Use the disassembly view tab and the architecture manual to see how the instructions are used in practice.
Posted at 11:29AM Nov 13, 2008 by Darryl Gove in Sun | Comments[15]





