Darryl Gove's blog
Querying locality groups
Locality groups are a mechanism that provides Solaris information about how the physical hardware is wired together. A locality group is a bunch of threads that share the same CPU or memory access characteristics. For example a locality group might be all the threads on a single chip.
The command to display the locality group information is lgrpinfo, but this is not on Solaris 10. Here's an example of the output from that command:
% lgrpinfo
lgroup 0 (root):
Children: 1 2
CPUs: 0-7
Memory: installed 16G, allocated 3.8G, free 12G
Lgroup resources: 1 2 (CPU); 1 2 (memory)
Latency: 90
lgroup 1 (leaf):
Children: none, Parent: 0
CPUs: 0-3
Memory: installed 8.0G, allocated 1.8G, free 6.2G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.263
Latency: 54
lgroup 2 (leaf):
Children: none, Parent: 0
CPUs: 4-7
Memory: installed 8.0G, allocated 2.0G, free 6.0G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0
Latency: 54
It is possible to access this programmatically:
#include <sys/lgrp_user.h>
#include <stdio.h>
#include <stdlib.h>
void explore(lgrp_cookie_t cookie,lgrp_id_t node,int level)
{
printf("Lgroup level %i\n",level);
int ncpus=lgrp_cpus(cookie,node,0,0,LGRP_CONTENT_DIRECT);
processorid_t * cpus=(processorid_t*)calloc(ncpus,sizeof(processorid_t));
lgrp_cpus(cookie,node,cpus,ncpus,LGRP_CONTENT_DIRECT);
printf("CPUs: ");
for(int i=0; i<ncpus; i++)
{
printf("%i ",cpus[i]);
}
printf("\n");
int nchildren=lgrp_children(cookie, node, 0,0);
lgrp_id_t* children=(lgrp_id_t*)calloc(nchildren,sizeof(lgrp_id_t));
lgrp_children(cookie, node,children,nchildren);
for (int i=0; i<nchildren; i++)
{
explore(cookie,children[i],level+1);
}
free(children);
}
void main()
{
lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_CALLER);
lgrp_id_t node = lgrp_root(cookie);
explore(cookie,node,0);
lgrp_fini(cookie);
}
Which provides the following output:
% cc local.c -llgrp % ./a.out Lgroup level 0 CPUs: Lgroup level 1 CPUs: 0 1 2 3 Lgroup level 1 CPUs: 4 5 6 7
Posted at 12:09PM Sep 30, 2009 by Darryl Gove in Sun |
Updated compiler flags article
Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.
Posted at 12:55PM Sep 28, 2009 by Darryl Gove in Sun |
Haskell (GHC) on UltraSPARC T2
Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.
Update:Link to slides
Posted at 09:37PM Sep 21, 2009 by Darryl Gove in Sun |
Profiling scripts
If you try to use the Sun Studio Performance Analyzer on something that is not an executable, you'll end up with an error message:
$ collect kstat Target `kstat' is not a valid ELF executable
The most reliable workaround for this that I've discovered is as follows. First of all make up shell script that executes the command passed into it:
$ more shell.sh #!/bin/sh $@
Then run the collect command as:
$ collect -F on /bin/sh shell.sh <script> <params>
The -F on is required so that collect follows forked processes, otherwise collect will just profile the top /bin/sh which will do minimal work before forking off the actual command.
When loading the resulting experiment into the Analyzer you have to load all the descendant processes. You can do this by going to the filter dialog box and selecting all the processes, or you can take the easier route of placing en_desc on into your .er.rc file in your home directory (this will tell the analyzer to always load the descendant processes, which might make loading experiments slower, but will guarantee that you actually load all the data, and not just the top-level code).
One other thing to note is that each new process can contribute wall and wait time, so the wall time shown in the analyzer can be a multiple of the actual wall time. To see this in action do:
$ collect -F on /bin/sh shell.sh shell.sh shell.sh shell.sh kstat
The wall time on this will be a multiple of the actual runtime because each shell script contributes wall time while it waits for the kstat command to complete.
Posted at 11:22AM Sep 21, 2009 by Darryl Gove in Sun |
Performance tuning webcast
I wrote one of the TechDays 2008-2009 sessions on application performance tuning. Unfortunately I never actually got to give it to alive audience, but I did get this version recorded. Thanks to the HPC Watercooler for pointing it out to me.
Posted at 12:33PM Sep 08, 2009 by Darryl Gove in Sun |
Shoelaces
I was chatting to one of the kids teachers this morning, apparently she ends up tying shoelaces for a bunch of kids in the class everyday. All of which reminded me of this alternative way of tying laces.
Posted at 09:10AM Sep 03, 2009 by Darryl Gove in Personal |
Profiling a rate
Sometimes it's the rate of doing something which is the target that needs to be improved through optimisation. ie increase the widgets per second of some application. I've just been looking at a code that estimated performance by counting the number of computations completed in a known constant length of time. The code was showing a performance regression, and I wanted to find out what changed. The analysis is kind of counter intuitive, so I thought I'd share an example with you.
Here's an example code that does a computation for a fixed length of time, in this case about 30 seconds:
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
double f1(int i)
{
double t=0;
while (i-->0) {t+=t;}
return t;
}
double f2(int i)
{
double t=0;
while (i-->0) {t+=t;}
return t;
}
void main(int argc,char**args)
{
struct timeval now;
long startsecs;
long count=0;
int vcount;
if (argc!=2){ printf("Needs a number to be passed in\n"); exit(0);}
vcount=atoi(args[1]);
gettimeofday(&now,0);
startsecs=now.tv_sec;
do
{
f1(100);
f2(vcount);
count++;
gettimeofday(&now,0);
} while (now.tv_sec<startsecs+30);
printf("Iterations %i duration %i rate %f\n",count, now.tv_sec-startsecs, 1.0*count/(now.tv_sec-startsecs));
}
The code takes a command line parameter to indicate the number of iterations to do in function f2, function f1 always does 100 iterations.
If I compile and run this code under the performance analyzer with 50 and 70 as the commandline parameters I get the following profile:
| Description | 50 Iterations | 70 Iterations |
| Total time | 26.6s | 25.87s |
| f1 | 11.89s | 10.66s |
| gettimeofday | 9.9s | 8.76s |
| f2 | 4.53s | 6.09s |
| Main | 0.28s | 0.37s |
| Total iterations | 942,684 | 841,921 |
We can make the following observation when we go from 70 down to 50 for parameter passed to f2, we see a 12% gain in the total rate. This is to be expected as we are reducing the total number of iterations of the pair of loops in f1 and f2 will reduce from 170 down to 150, which is the same ~12% gain.
Where it gets counter intuitive is that for the run which achieves the higher rate, the time spent in the routines f1 and gettimeofday increases - by the same 12%. This is counter intuitive because increased time in a routine normally indicates that the routine is the one to be investigated, but for a 'rate' situation the opposite is true. These routines are being well behaved. The way to think about it is that each unit of work needs a smidgeon of time in both of these routines, if the number of units of work increases, then the absolute amount of time in these two routines needs to increase linearly with the increase in rate.
However, the time in routine f2 decreases as the rate increases. This is the routine which has been "improved" to get the better rate. The other thing to note is that the time went from ~6s to ~4.5s, but the rate went from 841k to 941k, so the time per unit work dropped further than that - this makes comparing the profiles of the two runs more tricky.
Note that Amdahl's law would still tell us that the routines that need to be optimised are the ones where the time is spent - so in one sense nothing has changed. But my particular scenario today is figuring out what has changed in the executable when compiled in two different ways that leads to the performance gain. In this context, I now know the routine, and I can dig into the assembly code to figure out the why.
Posted at 04:37PM Sep 02, 2009 by Darryl Gove in Sun |


