Benchmarks, Workloads , Micro benchmarks and In-House Performance
Testing
As a contributor to the SPEC
performance organisation on behalf of Sun, I tend to notice and read
comments both negative and positive on the benchmarks SPEC creates
and administers, I read with particular interest articles on the
SPECjAppServer benchmarks that I am involved in. A few days ago I
was forwarded a post in which the author offers the opinion that the
SPECjAppServer2004
results provide no value and also offers a pretty
negative view of industry standard benchmarks in general. I
certainly don't believe that the SPEC benchmarks or any benchmarks
are perfect , nor do I thnk that they are the only valuable source for performance information but to claim that the results have no value seems ..well,
absurd. So I thought it
might be useful to offer some background and observations regarding performance measurement, and in the discussion below I try to categorise the main sources of performance information and to highlight the main benefits and shortcomings of each source.
(formal) Benchmark
A benchmark is comprised of a performance testing workload
(application) or workload definition + a set of run rules and
procedures that define how the workload will be run + a process for
ensuring that the published results conform to the run rules and to
prescribed "fair use" rules about how comparisons may be made
between results.
The workload is generally a (very) complex application and
includes the user simulations, data models either in code or
specification form and all the information necessary to run
repeatable performance tests over a (potentially) wide variety of
computing environments. The run rules and procedures define how the
workload will be run, what constitutes fair and reasonable tuning
techniques, what the requirements are for the products being tested
and the format and length of test runs and the reporting requirements
. The benchmark usage rules (fair use) outline how one benchmark
result can be compared to others and effectively puts constraints on
the claims that can be made about any particular benchmark result.
Hence (ideally) increasing the value of the published results to end
user consumers of the results.
Industry standard benchmarks organisations such as SPEC
or TPC are comprised of (IT)
companies, and interested individuals who contribute
time and/or money to the organisation to develop (complex) benchmarks
and to help manage these benchmarks. The reason these benchmark
organisations exist is to create benchmarks and performance data that
is credible, relevant and useful to end user consumers of this data.
There are many benefits to the contributing vendors in creating and
running benchmarks, having a forum to prove performance or
price/performance gains in the their products is certainly a big
motivation but not the only one , many of the benchmarks defined and
created for example by SPEC are used by hardware and software vendors
to improve their products, long before a result is ever published on
the competitive public site. So there are very sound engineering as well as marketing
reasons for vendors to contribute to the goal of creating credible relevant useful perfomance benchmarks.
Another valuable source of benchmark and performance data comes
from vendor benchmarks. The Oracle applications benchmarks
and SAP benchmarks are good and well know examples, the workload,
run rules and usage rules are defined by the vendor company and then
made available to 3rd parties or perhaps hardware partners
who want to run and tune these workloads on their environments. These
benchmarks have much in common with the industry standard benchmarks
but the scope is limited generally to just the product offered by the
vendor. These benchmarks are very useful for potential customers of
these systems, to gain the performance information required to size
implementations of these products and hence to build confidence in
the performance capacities of the system prior to purchase or
implementation.
Benefits
-
Extremely cheap for end users, normally the large IT vendors
have done most of the work and published results, end users then
need only look at these results and then decide if and how they
might be applied to their business and what comparisons they can
make based on the published data. End users can use the numbers with a degree of confidence knowing that the results have been audited or perhaps peer reviewed to ensure compliance to the benchmark rules.
-
There is a lot of tuning information and real value in the
benchmark results themselves, for instance consider the
SPECjAppServer2004 benchmark results. In each result you will find
the .html result page which is the full disclosure report (FDR).
The FDR contains not just the benchmarking results and final and
repeat run scores but also a wealth of tuning information, tuning
for the database the application server, hardware , java virtual
machine, JDBC driver and operating system, everything another user
might need to be able to reproduce the result. In the FDR there is
also the full disclosure archive (FDA). The FDA contains the
scripts, database schema, deployment information and instructions on
how to the environment was established. The SPECjAppServer2004 FDR
and FDA are valuable resources I use all the time on customer sites
as a reference on how to tune and configure their production and
test systems.
-
Again in reference to the FDR and FDA much of the raw data
and data rate information is useful, examples include the number of
concurrent web tier transactions or the network traffic or perhaps
the size of the database supported by the database hardware. These
data rates and speeds and feeds can be used to assess the
capabilities of certain parts of the system being tested and can be
useful in sizing some aspect of similar applications.
-
Hardware and software vendors use the benchmarks as tools to
improve their products which in flows to end users. A good example
of this at SPEC is when decided to use BigDecimal in the web tier
of SPECjAppServer2004, even prior to the SPECjAppServer2004
workload being released it was obvious to the java virtual machine
vendors participating at SPEC that this was an opportunity to
optimise BigDecimal processing in their JVMs. So before the first
SPECjAppServer2004 results were released the JVMs were already
providing optimisations for BigDecimal and SPECjAppServer2004 was
helping quantify the performance gains from these optimisations.
The benefits of these optimisations flowed to all users of java
BigDecimal that could move to the later JVMs.
-
Competition. Industry standard benchmarks are one way a
vendor can show performance improvements in their products and
performance leadership over their competition and perhaps gain a
marketing advantage, so there is a fiercely competitive aspect to
industry standard and vendor benchmarking. This competition is
generally good for end users as it commonly produces tunings and
optimisations in the vendors products that benefit a wide range of
applications using their technology and indeed this is the situation
that the run rules and fair use rules generally strive to promote.
Limitations
-
Inappropriate comparisons, or extrapolation of results. Care
must be taken to make selective and reasonable judgements based on
the information provided in the results or benchmark reports. It
makes no sense for instance to use SPECjAppServer2004 results which
is a transactional benchmark say to size a system for data
warehousing or business intelligence, a TPC-H benchmark would be the
place to go for this information. Also looking at the transaction
rate disclosed in a benchmark report and then extrapolating this
result upwards is risky as performance is not a continuous function
but instead can have many discrete jumps and tested configurations
may have hard ceilings such as memory capacity or bus bandwidth.
For example it might not be accurate to predict the performance of
a single instance of Glassfish application server on a 64 core
machine based on the JOPS/Glassfish instance result measured on a 4
core machine.
-
There will be limitations on how closely industry standard
benchmarks model your chosen or developed application. In the case
of SPECjAppServer2004 the developers and participants , companies
like IBM, Intel,Sun,Oracle have looked at our customer base and
tried to model the web applications we have seen our customers
developing or perhaps based our modelling decisions what they told
us they were going to develop.
-
Industry standard benchmarks trail the technology curve and
hence will often be using an older version of infrastructure /
technology than the market would like. This is because the
benchmark can't come out until there is an established set of
products to run the benchmark and because it takes time for
companies to run and scale the benchmarks and to build a set of
results that is useful for end users. For example the development of
SPECjAppServer2004 workload was well underway before there were
many/any J2EE 1.4 products available, but it wasn't released until
after most of the major J2EE application server companies had
released their products. Similarly work is underway for the new
version of SPECjAppServer but it is trailing the availability of
the application servers that have implemented the Java Enterprise
Edition 5.0 specification.
Workload
A performance workload is similar to the benchmark
described above but it lacks the run rules, process and oversight.
This means that end users can't (in general) have high levels of
confidence in the performance claims made by vendors publishing
results based on these workloads. End users reading benchmark
reports and performance claims made from using workloads without the
process of a formal benchmark will have much more work to do to
decide what comparisons make sense and what comparisons may in fact
even be misleading. For example a vendor could use a workload like
DBT2 and publish test results comparing say the largest server
hardware running database "A" and a tiny single cpu based server
running database "B" and then without disclosing the different
hardware platforms could offer this as data to suggest that database "A" is better performing than database "B". Sure this is an
exaggeration but it serves to demonstrate the value of the process
and disclosure rules of the formal benchmarks.
Benefits
-
Workloads don't have the restrictions of the process imposed
by the industry standard benchmark bodies and as such it is mush
easier to just run and report results from workloads. For example
in the open source database world the SysBench workload is a very
valuable tool and is commonly used for performance testing of code
changes to the MySQL database. Results of these tests are widely
and openly reported and used as the basis for even more performance
improvement. One key here is, in this situation the workload is
being used collaboratively for investigation not primarily
competitively to sell something.
-
Workloads are potentially designed by individuals and so
development cycles may be shorter that the industry standard
benchmarks.
Limitations
-
Risk of error , especially in tuning. Even though running
performance workloads in house can be relatively straight forward
there is still the risk of getting the wrong answer. Consider trying
to determine which is the best performing database "A" or "B"
by doing workload based performance testing of both. The user
running the tests and trying to accurately compare the results has
to have the expertise to be able to tune both "A" and "B" to
the point where they can make optimal use of the hardware and
operating system resources otherwise the results may be misleading.
-
There is a cost to running performance investigations
in-house and though running performance workloads may be relatively
cheap it is potentially more expensive than if published industry
standard or vendor benchmark figures are available and can be used.
Micro Benchmark
A micro benchmark is usually a small
generally simple workload that tests only a limited number of system
or user functions.
In fact most often the micro benchmark will not
have any process such as reporting rules or any basis for comparison
of results so really I believe a better terminology would be micro
workloads.
Benefits
-
Generally free or cheap to download or develop.
-
Very easy to run and report results on
-
Potentially very powerful too for diagnosing low-level
performance problems
-
Because micro benchmarks are generally fairly simple and only
measure a very small set of performance attributes then comparisons
may in fact be valid across platforms.
Limitations
-
It is generally not possible and rarely a good idea to
predict larger system or application performance based on micro
benchmarking, again by their nature micro benchmarks will test and
consider only a small sub set of the performance of the system being
tested so it is quite likely that other factors beyond the scope of
the micro benchmarks will effect total system response and
throughput.
In House Application Performance Testing (customer
benchmarking)
Where an end user, customer or software developer creates a
purpose built stress test (workload) for their application or
hardware and tests what they intend to run in production.
This is
arguably the most reliable way for an end user or application
developer to understand their code / environment as it involves
running the actual code planned to be run in production. There is no
requirement to try and apply the results and resource utilisation
from some other performance test (benchmark or workload) as it is your
code that is being run and directly observed.
Benefits
Limitations
-
By far this is the most expensive of all of the options but
for a company or individual potentially spending a lot of money on
hardware or software this may well be the best option and it might
be that the potentially substantial costs associated with
developing a test harness and simulation, determining the test
parameters and running the performance tests is well worth it. One
caveat here might be that as software costs fall with the increasing
enterprise use of open source software then the costs of running “in
house� performance tests may start to look large vrs software
purchase cost.
-
Running in-house application performance benchmarks is still
not without risk, a wide variety of skills is required to create the
simulation and determine that it covers the expected usage pattern
for the application. Different skills are needed to be able to
deploy and tune the application and any middle-ware required for
the application and also DBA skills will be required as well as
general performance tuning skills...the variables really start to
add up.
Summary
I hope to have provided a very high level overview and a useful
categorisation of the main sources of performance data available
today, in my opinion each of these sources or approaches to
obtaining performance data has great value however because
performance analysis is always a contentious and often a more
subjective topic than it should be I am not sure I expect to settle
too many debates. Hopefully offering a broader perspective on the
value of benchmarking than I have seen in (some) other forums is
useful to those who might be needing or relying on this data.