fintanr's weblog

Archives

« April 2005 »
MonTueWedThuFriSatSun
    
2
3
4
6
8
9
10
11
12
13
14
15
16
17
21
22
23
24
25
28
29
30
 
       
Today

the links




Twitter Updates

    follow me on Twitter
















    20050427 Wednesday April 27, 2005

    Binary compatability and application dependicies
    I see Oracle are echoing what many customers have been saying about Linux patches in this article over on the register.

    "Speaking at the Software 2005 conference in Santa Clara, California, on Tuesday, Phillips told an audience of 1,000 chief executive officers: "[Customers] are bogged down by the dependencies between applications. Add a patch in Linux and five other things break. People ask: 'Why can't you tell me what the dependencies are?'"

    This is exactly why binary compatability and interface stability are important. If you can't add patches with the confidence that your production systems will stay up, and that your applications will stay running on them, you can only be nervous. And it is also why Sun invests a huge amount of time into testing patches.
    (2005-04-26 20:47:36.0) Permalink Comments [4]

    20050426 Tuesday April 26, 2005

    Suns Performance Lifestyle : Automating Ourselves Into A Job
    I had originally planned to base my second posting [1] on Suns Performance Lifestyle around the concept of testing software versus hardware, ie the dreaded i/o bound benchmark. The article is part written, but unfortunately I haven't managed to schedule time on an available rig to put in some practical examples, real performance work obviously will get priority, so instead I decided to write a bit about our automation, and how we actually run the benchmarks.

    We view automation as being very much key to our job, it allows us to remove the mundane tasks and focus on the higher, and much more interesting, value add work, that of finding and root causing performance issues.

    High Level Overview

    From a high level view the process of doing a benchmark run can be broken down into the following steps
    • Install Rig
    • Install and configure required software
    • Run the benchmark
    • Collect the benchmark results
    Looks nice and straight forward doesn't it. I attended an amusing presentation a few years ago given by a colleague in Ireland entitled "A Simple Matter of Programming", which featured that "magic happens here" box that we have all encountered (you know the one when the Architect has handed over this beautiful design document that specs out everything bar the actual programming and implementation. The magic happens here box in the high level system diagram). Lets call this "A Simple Matter of Programming and Implementation"

    Installation

    The key to all of this is the automation of the installs. For Solaris we maintain a local jumpstart server on our lab network (with mirrors as needed on remote networks), while we also maintain images and install scripts for the various Linux clones of jumpstart (ie kickstart etc [2]) to use when needed. For windows its a gratuitous abuse of the dd command, and some nifty automation that Nicky, Sean and a non blogging member of the team have put together ;).

    I won't go into a jumpstart 101 tutorial here (although please ask if you would like to see something like this), but its suffice to say that we have all of the various steps in setting up a rig scripted.

    Once a system is installed it then copies over all the relevant benchmarks and software that it needs, reboots and disconnects itself from our lab naming service. We use the host file as our only naming service where applicable as we don't want any external factors effecting our runs. All benchmarks which involve any for of network traffic (the vast majority of the benchmarks) are run on private subnets.

    Execution

    Now to actually run the benchmarks we have a custom home grown harness that has evolved over the years, this has upsides and downsides. The upside is that the idea behind the core of the harness is very straight forward, the downside is that its implementation is relatively complex, and somewhat hardwired into our environment. To actually run a benchmark we go through the following steps.
    • Validate the config (ie make sure everything that we are expecting to be inplace such as network interfaces, relevant software, relevant disks and so on are in place)
      • Install a custom kernel if applicable
      • Reboot
    • Do any initial configuration thats needed, things such as building volumes
    • Apply the relevant system tunings. As mentioned before we aim to keep our tunings as close to out of the box Solaris as possible, so for most benchmarks this is a pretty small set of tunings, things such as ndd values, file system tunings where applicable, shared memory settings on images prior to Solaris 10 etc.
    • Apply any relevant software tunings, say increasing the threads for a webserver or upping cache sizes for directory servers.
    • Reboot the machine to ensure everything is clean (obviously things such as ndd tunings will be reset on reboot)
    • Start the actual benchmark run
      • newfs(1M) any filesystems that are going to be used by the benchmarks
      • Execute the benchmark
        During execution gather standard performance data
        i.e. vmstat(1M), mpstat(1M) etc
        Gather custom data if required
        Hooks exist for calling tools such as lockstat(1M), custom dtrace(1M) scripts,
        or other custom scripts when requested
      • Collect the results and put them in a standard reporting format
      • Copy the results back to our main server
      • Reboot
    • Restore the system to its initial blank configuration
    • Lather, rinse and repeat as many times as is feasibly possible for the benchmark (the more results the better).
    The lather rinse repeat stage is quite important, we restore the system to a completely blank state in terms of tunings and then start all over again. There is one big reason for this. All benchmark runs have to be completely repeatable

    Why So Hung Up On Being Repeatable?

    Its a question that we get every so often, why does everything have to be so repeatable? (our process is fine grained enough that barring an application crashing we can repeat each run on an OS instance with exactly the same pids for each process). Put simply to aid in debugging any problems we encounter. We have a couple of criteria before we log a bug
    • The obvious one is that of "has performance degraded?"
    • What is the variance on the results?
    • Is the variance less than 0.5%, and less than the degradation?
    If results are noisy we do some statistical analysis on them to ensure that it is a valid degradation. At that point we log a bug.

    If we allowed the runs to vary a large amount say by using a naming service that might go down during a run or doing multiple runs on the same box without rebooting we are running a very high probability of introducing variance, which then leads to having too much noise in our results, and then we can't confidently log a bug. Now as you can imagine everyone is busy, so the last thing we want to do is log spurious bugs about performance problems, and either waste our own time tracing them down or pass them on to one of our colleagues in development and have them wasting their time tracing down a phantom problem.

    Lets give a simple example, say I have a bunch of results from benchmark X, and we are just interested in metric Y from this benchmark. We see a performance drop off of is 0.7% in metric Y, but the variance in our results is 1.2% - the drop off is within the margin of error for the run, so we can't log a bug. If everything is completely repeatable we can first look at eliminating the cause of the variance, and then gather results to confirm if we do indeed have a problem. If we can't repeat our experiments exactly then we end up in a situation where its not possible to eliminate the variance and hence you can't log a bug, and a potential drop off in performance could reach you, the customer.

    From the opposite angle, that of the developer, if the problem is repeatable, and consistent, it makes his/her life an awful lot easier in trying to narrow it down (in most cases this is actually us, so we are making our own lives easier first), or alternatively it makes it a lot easier to put a fix through the exact same scenario.

    Pushing the Software To The Limit

    We aim at all times to push the system to the absolute limit without any IO bottlenecks, no paging etc. We can't stress this enough. In practice this gives mpstat output that has as close to 0 as possible in the idle column, and definitely 0 in the wait column, but with the columns still lined up (Bryan has a great comment on this, I'm paraphrasing here, but its along the lines "the tool was designed to report data with columns matching the titles, if the columns aren't matching the titles thats a pretty good indication that you have a problem"). So an mpstat from a sample rig may look like the following during an actual benchmark run.

    (I had to use a screenshot here, as its possible that some browsers may throw of the formatting, and someone would say, "but those columns aren't lined up". The mpstat here is from the tail end of a rampup on a benchmark).

    Custom Kernels and Standing On the Shoulders of Giants

    We mentioned the PerformancePIT and Performance Self Test processes before. For both of these processes we install what in Sun parlance is known as a bfu (you will hear a lot more about bfu's when OpenSolaris comes out).

    Bill Sommerfeld has posted a bit more about bfu's, or more accurately a tool called acr that was recently integrated directly into the Solaris Express gate that is used for resolving conflicts. Put simply tools like this eliminate the need for us to have any manual interaction with custom kernels, they just work, which again allows us to focus on the higher value add areas. (Ask anyone in Sun engineering if they have ever had to resolve bfu conflicts, grab a coffee before you ask though, or maybe a beer if your at a BOF).

    And Wrapped Around All Of This

    As you might guess we don't go around looking for idle machines and installing them with benchmarks, behind the scenes on our server we have a scheduler running which puts new builds onto machines, makes sure idle machines are running benchmarks, allow us to reserve machines and so on.

    You have also probably guessed that we don't look at every result that comes in, again we have automated all of this process as well, and we only look at results which are of interest to us, either big performance gains (is it a real gain, were we expecting it, if not what caused it) or small performance drop off. If a drop off is greater than 0.3% we start analyzing it, and if a gain is over 5% we will look for what has caused the jump. Invariably we have a heads up on any performance wins that are going to happen due to the PerformancePIT process, it is very rare that we have to analyze a big jump that hasn't gone through all of the proper development processes.

    Automating Ourselves Into A Job

    So why this title? What I have written about here is something that we don't even think about, it just happens. It may need to occasional nudge every so often (but thats the scheduler more than anything else), but in general this just goes on in the background. If we had to do this work manually we would all become very bored, very quickly, so we automate it. We use the same approach with everything that we encounter, if it can be done by a machine, get a machine to do it. There is always more work out there, new tech to play with it and in a place like Sun there is always something interesting to work on.


      [1] I was rather chuffed to see Sean and I mentioned on osnews,
           got to admit it was a very, very pleasant surprise.
      [2] Before I get a mail going kickstart was around before Jumpstart, it wasn't,
           Jumpstart has been in existence since at least Solaris 2.4 (thats the earliest
           version I have encountered),kickstart first appeared around Redhat 5.0 I believe,
           which would be around 1997 (please correct me if I'm wrong on this)

    Technorati Tag

    Technorati Tag

    (2005-04-25 21:46:28.0) Permalink Comments [0]

    Putting Trolls to Bed
    I just responded to this thread over on osnews following the announcement of Solaris Express 4/05. As a rule I generally don't feed trolls, but I am genuinely so bored with seeing "opensolaris doesn't exist" posts that I felt I had to respond. The post is reproduced in its entirety below.

    <Start OSNews Post>

    Sigh,

    This has been hashed over repeatedly, it takes some time to get an OS ready to be opensourced, last time I checked no one has ever tried to opensource something the size or complexity of Solaris.

    There is code available already, DTrace was released and is downloadable from http://www.opensolaris.org, and it is one of the most advanced, if not the most advanced system for diagnosing performance problems on live systems ever developed [1]

    People are working feverishly on OpenSolaris at the moment, trying to make sure everything is right - that means a lot of reviews to ensure we release unemcumbered code. We have no intentions of throwing a bunch of code over a wall with no due dilligence and forgetting about it, OpenSolaris is about further expanding an already large community, and letting everyone see what is in Solaris. You can choose to participate if you wish [2].

    It has been repeatedly stated that we are aiming towards Q2CY05, please look at the roadmap http://www.opensolaris.org/roadmap/

    If you wish to choose not to belive that we are going to opensource Solaris please feel free, its your choice - we look forward to proving you wrong (and belive me knowing that we are going to prove you, and all of the other naysayers, wrong is a nice feeling).

    On the other hand we are very grateful for the patience that most people are showing while we make sure we do this right.

    [1] Please review all of the rebuttals about what on Linux can replace DTrace at http://blogs.sun.com/roller/page/fintanr/20050306 before telling us about LTT, KProbes and OProfile.
    [2] http://blogs.sun.com/roller/page/jonathan/20050417

    <End OSNews Post>

    Now back to some real work. On code that is going to be opensourced very soon ;).

    Technorati Tag

    Technorati Tag

    (2005-04-25 16:03:38.0) Permalink Comments [1]

    20050420 Wednesday April 20, 2005

    Workaround for "FATAL: system is not bootable, boot command is disabled" on an obp
    Poor error messages are a major source of annoyance. I hit this one today, for the first time in a few years. Background info - a v210 was rather abruptly powered down and feeling somewhat ill. So I logged onto the sc and got to my console, and type boot as one does.

    {1} ok boot
    FATAL: system is not bootable, boot command is disabled
    
    Which is about as helpful as someone telling me the box is currently a brick. Which I know already. Anyway just in case you happen to hit this the fix/workaround is to set auto-boot? to false, reset the box, and then set it to true and finally boot as shown below.
    {1} ok setenv auto-boot? false
    auto-boot? =          false
    {1} ok reset-all
    
    SC Alert: Host System has Reset
    
    Sun Fire V210, No Keyboard
    Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
    OpenBoot 4.11.4, 4096 MB memory installed, Serial #xxxxxxxx.
    Ethernet address 0:3:ba:xx:xx:xx, Host ID: 83xxxxxx.
    
    {1} ok setenv auto-boot? true
    auto-boot? =          true
    {1} ok boot
    .......... lots of output ........
    volume management starting.
    The system is ready.
    
    xxxxxx console login:
    

    (2005-04-19 20:31:07.0) Permalink Comments [9]

    And Nicky V. is onboard....
    One of the other folks in my group, Nicky Veitch, has started his blog, joining Sean, Gleb and myself. Of interest straight away are his notes on filebench - a benchmark that will be released soon onto sourceforge ( Richard leads the design and development off filebench).

    I kinda like the current parking spot for his bike as well ;).
    (2005-04-19 19:55:57.0) Permalink Comments [0]

    20050419 Tuesday April 19, 2005

    libnjb on Solaris
    My mp3 player is a Creative Zen Xtra (I just don't drink enough Martini to own an iProduct [1]). Anyway I obviously would prefer to use Solaris rather than Windows for this little gadget, so I compiled up gnomad2 which requires libnjb.

    One of the issues compiling was a relatively common problem that I've encountered recently, that of types being defined for Linux but not for Solaris. The error in this case looks like

    [fintanr@tiresias libnjb-2.0] $ make
    cd src && make prefix=/usr/local
    /usr/bin/bash ..//libtool --mode=compile gcc    -I/usr/local/include -DHAVE_GETOPT_H 
    -DHAVE_LIBGEN_H -DHAVE_USLEEP -Wall -Wmissing-prototypes -c base.c
     gcc -I/usr/local/include -DHAVE_GETOPT_H -DHAVE_LIBGEN_H -DHAVE_USLEEP -Wall 
    -Wmissing-prototypes -c base.c  -fPIC -DPIC -o .libs/base.o
    In file included from base.c:9:
    libnjb.h:163: error: syntax error before "u_int8_t"
    libnjb.h:163: warning: no semicolon at end of struct or union
    libnjb.h:164: warning: type defaults to `int' in declaration of `usb_interface'
    libnjb.h:164: warning: data definition has no type or storage class
    
    And when we take a look at libnjb.h we see that our types are assumed to be the those that are on Linux. Anyway the fix for this is simple, just add the following to libnjb.h
    #ifdef __sun
    #define u_int8_t uint8_t
    #define u_int16_t uint16_t
    #define u_int32_t uint32_t
    #define u_int64_t uint64_t
    #endif
    
    and away you go. A patch has been submitted to the libnjb folks on Sourceforge all ready (the patch is actually for libnjb.h.in). If your really interested you can view it here.
    [1] Actually I like the fact that I can replace the battery myself in the Zen.
    (2005-04-19 03:35:53.0) Permalink Comments [2]

    20050418 Monday April 18, 2005

    perlgcc
    One of the more common complaints that you hear from people regarding perl on Solaris is that you need the Sun CC compiler to compile modules. This was addressed quite some time ago, but as I generally have cc available I never really notice the issue. Or I didn't until about ten minutes ago.

    I need a local copy of Expect.pm for some stuff I'm working on, so I went off to compile up IO::Tty which is required by the Expect.pm module. And as I typed make I get the following.

    /usr/bin/perl /usr/perl5/5.8.4/lib/ExtUtils/xsubpp  -typemap 
    /usr/perl5/5.8.4/lib/ExtUtils/typemap  Tty.xs > Tty.xsc && mv Tty.xsc Tty.c
    cc -c    -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_TS_ERRNO -xO3 -xspace -xildoff    
    -DVERSION=\"1.02\"  -DXS_VERSION=\"1.02\" -KPIC 
    "-I/usr/perl5/5.8.4/lib/i86pc-solaris-64int/CORE"  -DHAVE_DEV_PTMX -DHAVE_GRANTPT 
    -DHAVE_PTSNAME -DHAVE_SIGACTION -DHAVE_STRLCPY -DHAVE_SYS_STROPTS_H -DHAVE_TERMIOS_H 
    -DHAVE_TERMIO_H -DHAVE_TTYNAME -DHAVE_UNLOCKPT Tty.c
    cc: unrecognized option `-KPIC'
    cc: language ildoff not recognized
    cc: Tty.c: linker input file unused because linking not done
    Running Mkbootstrap for IO::Tty ()
    chmod 644 Tty.bs
    rm -f blib/arch/auto/IO/Tty/Tty.so
    LD_RUN_PATH="" cc  -G Tty.o  -o blib/arch/auto/IO/Tty/Tty.so
    cc: Tty.o: No such file or directory
    cc: no input files
    *** Error code 1
    make: Fatal error: Command failed for target `blib/arch/auto/IO/Tty/Tty.so'
    
    Basically whats happened here is that perl is compiled with the Sun compilers, and I'm trying to compile this module with gcc from /usr/sfw/bin. And perl doesn't like this (you can go take a look at /usr/perl5/5.8.4/lib/i86pc-solaris-64int/Config.pm for all the gory detail).

    Anyway the work around is to use a handy little script bundled in Solaris called perlgcc, so rather than your normal perl Makefile.PL; make; make test; make install you do

    /usr/perl5/5.8.4/bin/perlgcc Makefile.PL; make
    
    and away you go. As an aside -KPIC is an option for generating position idependent code, and is dealt with in a lot more detail in the linkers and libraries documentation.
    (2005-04-17 22:51:03.0) Permalink Comments [2]

    20050407 Thursday April 07, 2005

    Richard McDougal and Jim Mauro on bsc
    I noticed that Richard McDougal has started a blog, joining his partner in crime Jim Mauro. These guys authored Solaris Internals, a book that is used by everyone in our group on an almost daily basis. They are currently updating it for Solaris 10, it will be an invaluable resource once it comes out. Point your aggregators at their blogs, highly recommended.

    In the meantime they have some teasers in the form of two presentations posted on the Solaris Internals site, well worth a read.
    (2005-04-07 03:45:37.0) Permalink Comments [0]

    Writing iso CD's in Solaris
    A friend of mine who I got using Solaris 10a few months ago mailed and asked me how to burn cds on Solaris today. He is generally a mac user at home, and windows as a development platform in work (for J2EE apps), and only deploys apps on Unix rather than developing them there, so he is somewhat used to using gui tools for tasks such as writing cd's.

    However he has moved his home x86 box to Solaris 10 as he is using the Sun Java Enterprise System Application Server as his primary deployment platform these days, with Netbeans as his development environment, and he wanted to familarise himself with Solaris 10 as an underlying OS. Anyway a thirty second tutorial on writing cd's in Solaris was needed, so for those not familar with how to do this I figured I would post a quick example here.

    The two commands that you need to read up on are cdrw(1) and mkisofs(8) [1], with a handy confirmation step involving lofiadm(1M) if you are so inclined (lofiadm requires root privleges).

    Okay, so first off create a directory with all the files you want to burn to cd, lets say /tmp/foo, and go from there

    [fintanr@dhcp-ack03-200-118 tmp] $ mkisofs /tmp/foo > /tmp/foo.iso
    Total translation table size: 0
    Total rockridge attributes bytes: 0
    Total directory bytes: 0
    Path table size(bytes): 10
    Max brk space used 8000
    221 extents written (0 MB)
    
    Optional stage, confirm that its all okay by mounting the iso image (as root).
    Sun Microsystems Inc.   SunOS 5.10.1    snv_09  October 2007
    # lofiadm -a /tmp/foo.iso
    /dev/lofi/1
    # mount -F hsfs -o ro /dev/lofi/1 /mnt
    # cd /mnt
    # ls
    foo.zip
    #
    
    Okay, so this is all okay, lets burn the file to cd.
    [fintanr@dhcp-ack03-200-118 tmp] $ cdrw -i /tmp/foo.iso
    Looking for CD devices...
    Initializing device...done.
    Writing track 1...done.
    Finalizing (Can take several minutes)...done.
    
    And voila, all done. I belive there is a tool in gnome for doing this as well, but I haven't checked.
    [1] Not sure how mkisofs ended up in section 8 of the manpages, but I'll try to find out.
    (2005-04-06 20:41:55.0) Permalink Comments [2]

    20050405 Tuesday April 05, 2005

    Open Solaris CAB
    The Open Solaris Community Advisory Board has been announced, check out the cab page on opensolaris.org.

    Personally (but I would say this, wouldn't I) I think this is a fantastic line up, Roy Felding is an excellent choice as CAB chair (lets be fair here, Apache is probably the most high profile piece of open source on the planet, and it definately effects more people directly than any other piece of open source software), while from a Sun perspective Casper Dik is incredibly prolific on the various Solaris newsgroups and mailing lists (he is incredibly prolific internally as well, I don't know how he manages to get so much done), while Simon Phipps is a massive open source adovacate (I had the opportunity to attend a presentation Simon gave in Dublin a few years ago, and he is excellent to listen too).

    The community elected members are Rich Teer and Al Hooper - Rich's Solaris Systems Programming Book is fantastic, and seeing as Al is originally from Dublin he can only be a good bloke ;). Pints are on me the next time your back.

    Relevant blogs


    Alan Hargreaves has links to several media outlets and other blog postings over on his blog.
    Technorati Tag

    (2005-04-04 18:43:23.0) Permalink Comments [1]

    20050401 Friday April 01, 2005

    Enabling Suns Performance Lifestyle
    The recent article about Linux 2.6 being slower than 2.4 and Linus Torvalds calling for ongoing performance testing gives our group a timely reason to explain in a lot more detail what we do. One of my colleagues Sean McGrath posted a little teaser yesterday, so I'll add to that today before we start into a more detailed, and technical, set of posts.

    So what exactly is it that we do? Our group provides the infrastructure to help the wider Sun community to enable "Sun's Performance Lifestyle".We run a very large set of benchmarks on every build of every active train of Solaris using components of Sun's middleware stack (Java Enterprise System) or ISV apps (Oracle, Tibco, Reuters etc) where applicable, as well as benchmarking applications bundled in Solaris (ie Samba, Apache, Xorg), new java builds, JES on Linux and more in a totally automated manner. We also provide the same facilities to developers for work prior to integration, so that developers can make informed decisions regarding performance the whole way through the development cycle, rather than as an after thought.

    Our current matrix that we are running looks something like this

    -OSArch's
    Solaris
    Released Internal Builds
    Solaris Express Sparc
    amd64
    x86
    Solaris 10 Update Trains (currently s10u1)
    Solaris 9 Update Trains
    Solaris Patch Trains
    Java Enterprise System
    Released Internal Builds
    Solaris Express Sparc
    amd64
    x86
    Solaris 10 Update Train
    Solaris 10 FCS
    Solaris 9 Update Trains
    Solaris
    Development Builds
    Solaris Express Sparc
    amd64
    x86
    Java Solaris Express Sparc
    amd64
    x86
    Solaris 10 Update Train
    Solaris 10 FCS
    Solaris 9 Update Trains
    Windows
    Linux (Redhat, Suse)
    Userland Products
    (JES etc)
    Development Builds
    Solaris Solaris Express Sparc
    amd64
    x86
    Requested Solaris Builds
    JES for Linux Linux (Suse, Redhat) amd64
    x86

    And coming soon - OpenSolaris (we are hyped about this).

    As for numbers, last month we ran over 50,000 benchmarks - some are tiny, taking only seconds to run, some take several days, it all depends on what your benchmarking. Of course, as mentioned, this is all automated.

    A Brief Mention of Our Lab

    Our lab consists of about 800 odd machines, ranging from 1 cpu sparc, amd64 and intel boxes, all the way up to 72cpu E25k's, and covering just about everything in between. I use lab as a virtual concept here, we have a large lab in Ireland, and then machines in Boston, Austin, Menlo Park etc. Alongside this we share time on machines with other groups, dispersed all over the world, lets just say a fully loaded 25k costs a lot, and while we could use it continously, it makes more sense to use it sensibly and share it. And of course some of the boxes currently under development are only available in very small numbers, so we have no option other than to share.

    As an aside my personal favourite rig at the moment is a fully loaded E6800, with several terabytes of disk space (6120 Fibre Channel Arrays) attached which we use predominantly for I/O benchmarks, the kind of I/O that enterprise customers are doing. And a second aside here, any version of Solaris that we install on this machine can also be installed on a single cpu Sun Blade 2500 - no recompiles, special patches or kernel hacks needed, it just works.

    Upstream Performance Work

    PerfPIT

    We provide a performance pre integration test environment (PerfPIT), which every major project going into Solaris has to go through. Most groups use this at multiple stages during their project. Now lets put this in perspective, every major project that you hear about in Solaris (and a huge amount of ones that people are starting to write about) has to come through PerfPIT. So things like DTrace, Zones, Least Privilege, FireEngine etc all of these projects did one or more PerfPIT runs before integrating into Solaris. And what does PerfPIT involve you ask, basically we run the exact same set of benchmarks as we run in our more downstream testing on two kernels - one with the changes, and nothing but the changes, and one without.

    Performance Self Test

    Further upstream we provide another version of PerfPIT, called Performance Self Test, which is a mechanism for the development community to test performance changes more informally. The key here is that this is simple to use, and we provide a standard environment. Developers go to a webpage, point it at their kernels, select the benchmarks they want to run and hit submit. Everything else happens automatically.

    The best example of using this that I have seen was last year, when one developer in Sun was evaluating several new algorithims for a specific project. Rather than having to go through the PerfPIT process for each version he just submitted multiple self test requests, and choose the best solution - without ever having to setup a benchmark, e-mail or phone anyone in the group, or do anything that distracted him from the task at hand. Amusingly it wasn't until he had done his putback into Solaris that we realised what he had been doing. Thats automation for you though.

    Why do this?

    Simply put you are not allowed put your code back into the Solaris code base if its going to slow it down. One of our main goals is to protect performance in Solaris, so when performance improves we move our baselines to the new high water mark, and all subsequent builds are not allowed to regress from this new baseline. There are very, very occasional exceptions to this, i.e. if the fix is required to prevent crashes and data corruption, and it cannot be implemented without causing a performance regression. In the whole three years of Solaris 10 development this occured once, on one metric - and there were thousands, upon thousands of putbacks.

    Its due to the immense amount of work that our development colleagues have done, and are doing on Solaris, and our own work in providing practical support for them that Solaris 10 screams. And its getting faster every day. Our aim is to enable the best, not catch problems after they have occured.

    The Benchmarks

    The benchmarks themselves range from things such as SpecWeb to Kenbus to homebaked benchmarks for measuring things such as boottime and finally, and most importantly, real customer workloads (which we are always looking for - if you have an enterprise workload let us know, we are always interested in getting these in house).

    Bug hunting and analysis

    Obviously all of this work throws up some bugs, and we tend to be very methodical and exacting in our analysis of the bugs, our aim being to narrow it down to the exact lines of code that have caused a problem rather than just saying "theres a problem here". Theres quite a bit to this, so I'll leave further discussion for a seperate post.

    What we don't do

    We don't provide benchmark numbers for release - we work on out of the box Solaris, with as little tuning as possible to create the most realistic customer environment. Our colleagues in Market Development Engineering and Strategic Applications Engineering are focussed on getting the numbers that you see in press releases, so you wont see us mentioning numbers here.

    What we are hyped about

    OpenSolaris - we are so hyped about this its beyond belief. The beta community is very active already, and as we get closer to the code being released completely into the wild we are getting ready to work with the OpenSolaris community - and we can't wait.

    About the group

    I guess I should say a little bit about our team, the group is pretty small, ten full time engineers ( Sean, Gleb and I are the current bloggers), two interns at any given time and one manager (or ex-engineer as we prefer to call him ;) ). We are based in Ireland, although somewhat more geographically spread out than just the Dublin office, and most of us have been with Sun five plus years.

    Finally, keep an eye on our blogs, over the next few weeks we will go into a lot more detail about what we do, and how we do it.

    Technorati Tag

    Technorati Tag

    (2005-04-01 01:43:10.0) Permalink Comments [0]