Darryl Gove's blog

Wednesday Sep 30, 2009

Querying locality groups

Locality groups are a mechanism that provides Solaris information about how the physical hardware is wired together. A locality group is a bunch of threads that share the same CPU or memory access characteristics. For example a locality group might be all the threads on a single chip.

The command to display the locality group information is lgrpinfo, but this is not on Solaris 10. Here's an example of the output from that command:

% lgrpinfo
lgroup 0 (root):
        Children: 1 2
        CPUs: 0-7
        Memory: installed 16G, allocated 3.8G, free 12G
        Lgroup resources: 1 2 (CPU); 1 2 (memory)
        Latency: 90
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-3
        Memory: installed 8.0G, allocated 1.8G, free 6.2G
        Lgroup resources: 1 (CPU); 1 (memory)
        Load: 0.263
        Latency: 54
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 4-7
        Memory: installed 8.0G, allocated 2.0G, free 6.0G
        Lgroup resources: 2 (CPU); 2 (memory)
        Load:    0
        Latency: 54

It is possible to access this programmatically:

#include <sys/lgrp_user.h>
#include <stdio.h>
#include <stdlib.h>

void explore(lgrp_cookie_t cookie,lgrp_id_t node,int level)
{
  printf("Lgroup level %i\n",level);
  
  int ncpus=lgrp_cpus(cookie,node,0,0,LGRP_CONTENT_DIRECT);
  
  processorid_t * cpus=(processorid_t*)calloc(ncpus,sizeof(processorid_t));
  lgrp_cpus(cookie,node,cpus,ncpus,LGRP_CONTENT_DIRECT);
  printf("CPUs: ");
  for(int i=0; i<ncpus; i++)
  {
    printf("%i ",cpus[i]);
  }
  printf("\n");
  
  int nchildren=lgrp_children(cookie,  node, 0,0);

  lgrp_id_t* children=(lgrp_id_t*)calloc(nchildren,sizeof(lgrp_id_t));
  lgrp_children(cookie,  node,children,nchildren);

  for (int i=0; i<nchildren; i++)
  {
    explore(cookie,children[i],level+1);
  }
  free(children);
}

void main()
{
  lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_CALLER); 
  lgrp_id_t node = lgrp_root(cookie);
  explore(cookie,node,0);
  lgrp_fini(cookie);
}

Which provides the following output:

% cc local.c -llgrp
% ./a.out
Lgroup level 0
CPUs:
Lgroup level 1
CPUs: 0 1 2 3
Lgroup level 1
CPUs: 4 5 6 7

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Monday Sep 21, 2009

Haskell (GHC) on UltraSPARC T2

Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.

Update:Link to slides

Profiling scripts

If you try to use the Sun Studio Performance Analyzer on something that is not an executable, you'll end up with an error message:

$ collect kstat
Target `kstat' is not a valid ELF executable

The most reliable workaround for this that I've discovered is as follows. First of all make up shell script that executes the command passed into it:

$ more shell.sh
#!/bin/sh
$@

Then run the collect command as:

$ collect -F on /bin/sh shell.sh <script> <params>

The -F on is required so that collect follows forked processes, otherwise collect will just profile the top /bin/sh which will do minimal work before forking off the actual command.

When loading the resulting experiment into the Analyzer you have to load all the descendant processes. You can do this by going to the filter dialog box and selecting all the processes, or you can take the easier route of placing en_desc on into your .er.rc file in your home directory (this will tell the analyzer to always load the descendant processes, which might make loading experiments slower, but will guarantee that you actually load all the data, and not just the top-level code).

One other thing to note is that each new process can contribute wall and wait time, so the wall time shown in the analyzer can be a multiple of the actual wall time. To see this in action do:

$ collect -F on /bin/sh shell.sh shell.sh shell.sh shell.sh kstat

The wall time on this will be a multiple of the actual runtime because each shell script contributes wall time while it waits for the kstat command to complete.

Tuesday Sep 08, 2009

Performance tuning webcast

I wrote one of the TechDays 2008-2009 sessions on application performance tuning. Unfortunately I never actually got to give it to alive audience, but I did get this version recorded. Thanks to the HPC Watercooler for pointing it out to me.

Thursday Sep 03, 2009

Shoelaces

I was chatting to one of the kids teachers this morning, apparently she ends up tying shoelaces for a bunch of kids in the class everyday. All of which reminded me of this alternative way of tying laces.

Wednesday Sep 02, 2009

Profiling a rate

Sometimes it's the rate of doing something which is the target that needs to be improved through optimisation. ie increase the widgets per second of some application. I've just been looking at a code that estimated performance by counting the number of computations completed in a known constant length of time. The code was showing a performance regression, and I wanted to find out what changed. The analysis is kind of counter intuitive, so I thought I'd share an example with you.

Here's an example code that does a computation for a fixed length of time, in this case about 30 seconds:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

double f1(int i)
{
  double t=0;
  while (i-->0) {t+=t;}
  return t;
}

double f2(int i)
{
  double t=0;
  while (i-->0) {t+=t;}
  return t;
}

void main(int argc,char**args)
{
  struct timeval now;
  long startsecs;
  long count=0;
  int vcount;
  if (argc!=2){ printf("Needs a number to be passed in\n"); exit(0);}
  vcount=atoi(args[1]);

  gettimeofday(&now,0);
  startsecs=now.tv_sec;

  do
  {
    f1(100);
    f2(vcount);
    count++;
    gettimeofday(&now,0);
  } while (now.tv_sec<startsecs+30);

  printf("Iterations %i duration %i rate %f\n",count, now.tv_sec-startsecs, 1.0*count/(now.tv_sec-startsecs));
}

   

The code takes a command line parameter to indicate the number of iterations to do in function f2, function f1 always does 100 iterations.

If I compile and run this code under the performance analyzer with 50 and 70 as the commandline parameters I get the following profile:

Description50 Iterations70 Iterations
Total time26.6s25.87s
f111.89s10.66s
gettimeofday9.9s8.76s
f24.53s6.09s
Main0.28s0.37s
Total iterations942,684841,921

We can make the following observation when we go from 70 down to 50 for parameter passed to f2, we see a 12% gain in the total rate. This is to be expected as we are reducing the total number of iterations of the pair of loops in f1 and f2 will reduce from 170 down to 150, which is the same ~12% gain.

Where it gets counter intuitive is that for the run which achieves the higher rate, the time spent in the routines f1 and gettimeofday increases - by the same 12%. This is counter intuitive because increased time in a routine normally indicates that the routine is the one to be investigated, but for a 'rate' situation the opposite is true. These routines are being well behaved. The way to think about it is that each unit of work needs a smidgeon of time in both of these routines, if the number of units of work increases, then the absolute amount of time in these two routines needs to increase linearly with the increase in rate.

However, the time in routine f2 decreases as the rate increases. This is the routine which has been "improved" to get the better rate. The other thing to note is that the time went from ~6s to ~4.5s, but the rate went from 841k to 941k, so the time per unit work dropped further than that - this makes comparing the profiles of the two runs more tricky.

Note that Amdahl's law would still tell us that the routines that need to be optimised are the ones where the time is spent - so in one sense nothing has changed. But my particular scenario today is figuring out what has changed in the executable when compiled in two different ways that leads to the performance gain. In this context, I now know the routine, and I can dig into the assembly code to figure out the why.

Calendar

Search this blog

About

Solaris Application Programming

Book resources

The Developer's Edge

Book resources

OpenSPARC Internals

Book resources

Recent entries

Custom search

Tag cloud

book cmt compiler cooltools cpu2006 dtrace gcc libraries linker multithreading openmp opensolaris opensparc optimisation optimization parallelisation parallelization performance performanceanalyzer programming solaris solarisapplicationprogramming sparc spec spot sunstudio t2 ultrasparc ultrasparct2 x86

Links

Webcasts

Articles

Presentations

Interesting docs

Navigation

Referers

Feeds