|

Tuesday April 26, 2005
Suns Performance Lifestyle : Automating Ourselves Into A Job
I had originally planned to base my second posting [1] on Suns Performance Lifestyle around
the concept of testing software versus hardware, ie the dreaded i/o bound benchmark.
The article is part written, but unfortunately I haven't managed to schedule time on
an available rig to put in some practical examples, real performance work obviously
will get priority, so instead I decided to write a bit about our automation, and how we
actually run the benchmarks.
We view automation as being very much key to our job, it allows us to remove the
mundane tasks and focus on the higher, and much more interesting, value add work, that
of finding and root causing performance issues.
High Level Overview
From a high level view the process of doing a benchmark run can be
broken down into the following steps
- Install Rig
- Install and configure required software
- Run the benchmark
- Collect the benchmark results
Looks nice and straight forward doesn't it. I attended an amusing presentation a few
years ago given by a colleague in Ireland entitled "A Simple Matter of Programming",
which featured that "magic happens here" box that we have all encountered (you know
the one when the Architect has handed over this beautiful design document that specs out
everything bar the actual programming and implementation. The magic happens here box in the
high level system diagram).
Lets call this "A Simple Matter of Programming and Implementation"
Installation
The key to all of this is the automation of the
installs. For Solaris we maintain a local jumpstart server on our lab network (with
mirrors as needed on remote networks), while we also maintain images and install
scripts for the various Linux clones of jumpstart (ie kickstart etc [2]) to use when
needed. For windows its a gratuitous abuse of the dd command, and some nifty automation
that Nicky, Sean and a non
blogging member of the team have put together ;).
I won't go into a jumpstart 101 tutorial here (although please ask if you would like
to see something like this), but its suffice to say that we have all of the various
steps in setting up a rig scripted.
Once a system is installed it then copies over all the relevant benchmarks and software
that it needs, reboots and disconnects itself from our lab naming service. We use
the host file as our only naming service where applicable as we don't want any
external factors effecting our runs. All benchmarks which involve any for of network
traffic (the vast majority of the benchmarks) are run on private subnets.
Execution
Now to actually run the benchmarks we have a custom home grown harness that has evolved
over the years, this has upsides and downsides. The upside is that the idea behind
the core of the harness is very straight forward, the downside is that its implementation
is relatively complex, and somewhat hardwired into our environment. To actually run a benchmark
we go through the following steps.
- Validate the config (ie make sure everything that we are expecting to be inplace
such as network interfaces, relevant software, relevant disks and so on are in place)
- Install a custom kernel if applicable
- Reboot
- Do any initial configuration thats needed, things such as building volumes
- Apply the relevant system tunings. As mentioned before we aim to keep our tunings
as close to out of the box Solaris as possible, so for most benchmarks this is a
pretty small set of tunings, things such as ndd values, file system tunings where
applicable, shared memory settings on images prior to Solaris 10 etc.
- Apply any relevant software tunings, say increasing the threads for a webserver or
upping cache sizes for directory servers.
- Reboot the machine to ensure everything is clean (obviously things such as ndd
tunings will be reset on reboot)
- Start the actual benchmark run
- newfs(1M) any filesystems that are going to be used by the benchmarks
- Execute the benchmark
During execution gather standard performance data
i.e. vmstat(1M),
mpstat(1M) etc
Gather custom data if required
Hooks exist for calling tools such as lockstat(1M), custom
dtrace(1M) scripts,
or other custom scripts when requested
- Collect the results and put them in a standard reporting format
- Copy the results back to our main server
- Reboot
- Restore the system to its initial blank configuration
- Lather, rinse and repeat as many times as is feasibly possible for the benchmark
(the more results the better).
The lather rinse repeat stage is quite important, we restore the system to a completely
blank state in terms of tunings and then start all over again. There is one big reason for
this. All benchmark runs have to be completely repeatable
Why So Hung Up On Being Repeatable?
Its a question that we get every so often, why does everything have to be so repeatable? (our
process is fine grained enough that barring an application crashing we can repeat each run
on an OS instance with exactly the same pids for each process). Put simply to aid in
debugging any problems we encounter. We have a couple of criteria before we log a bug
- The obvious one is that of "has performance degraded?"
- What is the variance on the results?
- Is the variance less than 0.5%, and less than the degradation?
If results are noisy we do some statistical analysis on them to ensure that it is a
valid degradation. At that point we log a bug.
If we allowed the runs to vary a large amount say by using a naming service that
might go down during a run or doing multiple runs on the same box without rebooting we are running
a very high probability of introducing variance, which then leads to having too much noise in
our results, and then we can't confidently log a bug. Now as you can imagine everyone is busy,
so the last thing we want to do is log spurious bugs about performance problems, and either
waste our own time tracing them down or pass them on to one of our colleagues in development
and have them wasting their time tracing down a phantom problem.
Lets give a simple example, say I have a bunch of results from benchmark X, and we are
just interested in metric Y from this benchmark. We see a performance drop off of
is 0.7% in metric Y, but the variance in our results is 1.2% - the drop off is within the margin
of error
for the run, so we can't log a bug. If everything is completely repeatable we can first look at
eliminating the cause of the variance, and then gather results to confirm if we do indeed have
a problem. If we can't repeat our experiments exactly then we end up in a situation where its
not possible to eliminate the variance and hence you can't log a bug, and a potential drop off
in performance could reach you, the customer.
From the opposite angle, that of the developer, if the problem is repeatable, and consistent, it
makes his/her life an awful lot easier in trying to narrow it down (in most cases this is
actually us, so we are making our own lives easier first), or alternatively it makes it a lot
easier to put a fix through the exact same scenario.
Pushing the Software To The Limit
We aim at all times to push the system to the absolute limit without any IO bottlenecks, no paging
etc. We
can't stress this enough. In practice this gives mpstat output that has as close to 0 as possible
in the idle column, and definitely 0 in the wait column, but with the columns still lined up (Bryan
has a great comment on this, I'm paraphrasing here, but its along the lines "the tool was
designed to report data with columns matching the titles, if the columns aren't matching the
titles thats a pretty good indication that you have a problem"). So an mpstat from a sample
rig may look like the following during an actual benchmark run.
(I had to use a screenshot here, as its possible that some browsers may throw of the
formatting, and someone would say, "but those columns aren't lined up". The mpstat here is from
the tail end of a rampup on a benchmark).
Custom Kernels and Standing On the Shoulders of Giants
We mentioned the PerformancePIT and Performance Self Test processes before. For both of
these processes we install what in Sun parlance is known as a bfu (you will hear a lot more
about bfu's when OpenSolaris comes out).
Bill Sommerfeld has posted a bit more about bfu's,
or more accurately a tool called acr that was recently integrated directly into the Solaris Express gate that is used
for resolving conflicts. Put simply tools like this eliminate the need for us to have any
manual interaction with custom kernels, they just work, which again allows us to focus on the
higher value add areas. (Ask anyone in Sun engineering if they have ever had to resolve bfu
conflicts, grab a coffee before you ask though, or maybe a beer if your at a BOF).
And Wrapped Around All Of This
As you might guess we don't go around looking for idle machines and installing them with
benchmarks, behind the scenes on our server we have a scheduler running which puts new builds
onto machines, makes sure idle machines are running benchmarks, allow us to reserve machines
and so on.
You have also probably guessed that we don't look at every result that comes in, again we
have automated all of this process as well, and we only look at results which are of interest
to us, either big performance gains (is it a real gain, were we expecting it, if not what
caused it) or small performance drop off. If a drop off is greater than 0.3% we start analyzing
it, and if a gain is over 5% we will look for what has caused the jump. Invariably we have a heads up
on any performance wins that are going to happen due to the PerformancePIT process, it is very
rare that we have to analyze a big jump that hasn't gone through all of the proper development
processes.
Automating Ourselves Into A Job
So why this title? What I have written about here is something that we don't even think about,
it just happens. It may need to occasional nudge every so often (but thats the scheduler more
than anything else), but in general this just goes on in the background. If we had to do this
work manually we would all become very bored, very quickly, so we automate it. We use the same
approach with everything that we encounter, if it can be done by a machine, get a machine to
do it. There is always more work out there, new tech to play with it and in a place
like Sun there is always something interesting to work on.
[1] I was rather chuffed to see Sean and I mentioned on osnews,
got to admit it was a very, very pleasant surprise.
[2] Before I get a mail going kickstart was around before Jumpstart, it wasn't,
Jumpstart has been in existence since at least Solaris 2.4 (thats the earliest
version I have encountered),kickstart first appeared around Redhat 5.0 I believe,
which would be around 1997 (please correct me if I'm wrong on this)
(2005-04-25 21:46:28.0)
Permalink
Trackback URL: http://blogs.sun.com/fintanr/entry/suns_performance_lifestyle_automating_ourselves
|