Eoin Lawless

Tuesday Sep 22, 2009

Two Recent News Stories

Netflix Prize awarded

The Netflix Prize was finally awarded three years after the competition began. A team of seven individuals (BellKor's Pragmatic Chaos) scooped the $1 million prize.

Netflix is a movie rental company, and in 2006 were looking for a way to improve their movie recommendation system. This was important to them, as it is to Amazon and pretty much every other online retailer, since it allows customers quickly find products that will appeal to them, and hence increase sales and return custom.

They released a database of 100 million movie ratings from almost half a million customers. The challenge was to predict the ratings these users would give to a quiz dataset of movies. Netflix knew and withheld these ratings and marked competitors by how closely their predictions match the real scores that customers gave to withheld ratings.

The competition is important in that it significantly advanced the state of the art in recommender systems (aka collaborative filtering) in an open and collaborative manner.

A whole host of techniques were either developed or enhanced to tackle the problem and to handle the large size of the database - the largest publically available database of customer rating data.

One interesting outcome was that no single method of analysing the data was sufficient to achieve the required quality of predictions. Combinations or ensembles of many different techniques provided the best results.

All the details, including the papers describing the winning algorithms, are available at the Netflix Prize website.

Dead Fish responds to Human Emotion

This story has been doing the rounds of Slashdot and the Register recently, but is pertinent here given my interest in statistics.

The researchers in human brain mapping used the standard statistical tools for analysing MRI data of human brains and applied it to a fish. A dead Atlantic salmon to be exact. The test otherwise followed normal lines: "the salmon was shown a series of photographs depicting human individuals in social situations. The salmon was asked to determine what emotion the individual in the photo must have been experiencing."

Afterwards the found correlations between brain activity as measured by the MRI equipment and analysed using standard techniques, and the photographs of the people. Thus proving that a dead fish responds to the emotions displayed by humans in social situations.... or perhaps that the standard statistical analysis needs more rigour.

The original poster is online.

Friday Sep 04, 2009

Confidence and Deviation

Confidence and Deviation

The seventh circle of benchmarking hell is reserved for the analyst who quotes a single number when reporting the performance of a system, and supplies no other information. In the outer circles of this particular hell the analyst might provide details such as how many times the benchmark was run, what units the result is in, complete details on the configuration of the system and benchmark etc.

For marketing purposes it may be desirable to have a pithy result like "15% faster than the competition!" However if we have a genuine wish to understand one or more sets of performance data, then there are lots of statistical tools that can help.

In this entry I'm going to look at two related concepts, and explain the important differences between them: standard deviation and confidence intervals.

Although in theory computers are deterministic - return the same output for a given input - in practice the inputs are so numerous and varied that we cannot completely control them all. For example, consider a disk IO benchmark - a sudden noise or an increase in vibration due to a fan starting can cause a change in performance.

In order to account for the range of possible results we typically perform a benchmark multiple times, while controlling as far as possible external disturbances. At the end we will have a set of numbers. It is not feasible to quote the entire set of numbers to describe the performance the system. Usually we use the average value to describe the performance. However this doesn't take any account of the range of individual results. This is where standard deviation makes its appearance. It is estimated using a simple formula:

but don't worry about the exact details for now.

It is essentially a measure of dispersion or of how spread out the numbers are, while the mean is a measure of the central point of the range.

The mean and standard deviation are intrinsic properties of a system. It is usually not possible to directly measure them, but we can estimate them by making a number of measurements. For example: do people prefer the colour blue to the colour red? We can't ask everyone in the world, so we estimate the answer by asking a sample of the population. In the same way we estimate the mean and standard deviation of a systems performace by running a benchmark a number of times, and using the average value to estimate the mean, and the formula above to estimate the standard deviation.

It is important to note that our estimate of the standard deviation does not tell us anything about how accurate our estimate of the mean is. A large (estimated) standard deviation does not imply that our estimate of the mean is bad, and conversely a small (estimated) standard deviation does not imply that our estimate of the mean is highly accurate. (It is bad only in so far as it makes the performance of a system less predictable). These two measures are unchanging properties of the system that we merely estimating. But we don't know how good our estimates are.

This is where confidence intervals come in. We typically calculate the confidence interval for the mean but it can be done for the estimate of the standard deviation as well. It provides a description of the data in the form "we are 95% certain that the mean of the population is in the range X to Y." This is a statement about the accuracy of our estimates. The smaller the range of X to Y, the more accurate. We can generally make this range smaller if required by running more benchmarks. This is quite different to the standard deviation (or mean) which has a fixed value that we estimate, but the real (unknown) value of which will not change by rerunning the benchmark.

To summarize:

  • Standard deviation is a measure of the intrinsic variability of a system
  • The 95% confidence interval is a measure of the quality of our estimate of the mean (or standard deviation).
  • We can reduce the standard deviation by improving the system itself
  • We can reduce the size of the confidence interval by running the benchmark lots of times.

Finally, I am not going to describe the mechanics of calculating the 95% confidence interval. Wikipedia describes it quite well, and most stats packages will do the heavy lifting for you; I like to use R.

I hope this helps understand where and when standard deviation and confidence intervals can be used, and keep you out of benchmark analysis hell. There are some caveats - for instance, the assumption of normalacy - but I will leave those for another day.

Wednesday Sep 02, 2009

Erratic network performance: Spin mutexes vs. Interrupts

Erratic network performance: Spin mutexes vs. Interrupts

I was recently investigating the cause of high variance in network performance between Logical Domains on a SunFire T2000. I was running the iperf benchmark from one LDom guest to two other LDom guests. The rig was configured like this:
# ldm ls
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      8     4G       2.1%  1d 1h
oaf381-ld-1      active     -n----  5000    8     6G        13%  1m
oaf381-ld-2      active     sn----  5001    8     6G       0.0%  1m
oaf381-ld-3      active     sn----  5002    8     6G       0.0%  2m
Sometimes I would see throughput of up to 1360 Mb/s, but other runs it would drop to as low as 870 Mb/s. Here's a graph of the benchmark results, as you can see they are very erratic. (You may need to open it in a separate window if your browser scales it).

Looking at mpstat output there seemed by some sort of connection between a high spin mutex count and performance, but it's hard to get a grasp of tens of mpstat outputs at once.

For example here is mpstat output for a run with a result of 1318 Mb/s

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0 1490   313    1   66   11    2 4250    0     4    0 100   0   0
  1    0   0 1467   314    0   71    9    4 4678    0     4    0 100   0   0
  2    0   0  486  2207    4 3687    2 1277  187    0    34    0  24   0  76
  3    0   0  192  1048    2 1574    2  526  106    0    21    0  12   0  87
  4    0   0  627  3302    5 6008    9  657  163    0    36    0  30   0  70
  5    0   0  608  3134    6 5597   11  695  159    0    45    0  31   0  68
  6    0   0 3911  6130 4094 4590   31  663  222    0    62    0  44   0  56
  7    0   0 4462  6279 4205 4625   32  666  238    0    50    0  45   0  55

and here is mpstat output for a run with a result of 882 Mb/s
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  666  5338    4 9695    8  523  318    0    29    0  34   0  66
  1    0   0  540  4593    6 8272    9  506  277    0    34    0  31   0  69
  2    0   0  405  3382    6 5795    9  448  202    0    43    0  28   0  72
  3    0   0 1644  2037  112 3208    5  338  124    0    48    0  25   0  75
  4    0   0   84   928    4 1283    1  402   82    0    30    0  14   0  86
  5    0   0   36   503    2  496    0  152   31    0    15    0   5   0  95
  6    0   0 6490  6540 6102   87   23    1 2197    0     5    0 100   0   0
  7    0   0 6485  6547 6107   92   23    2 2336    0     5    0 100   0   0
The best way I found to see the pattern was to graph it. For each benchmark run I found what CPU had the highest smtx count, and plotted that smtx value against the iperf result, using a different colour for each CPU. The graph is below and reveals an unusual pattern:

A few notes:

  • There appear to be four groupings of behaviour
  • If the highest smtx count is on CPU 6 or 7, the iperf result is low
  • If the highest smtx count is on CPU 1 or 2 the iperf result is high
  • The highest smtx count is never on CPU 3.
  • There is a range of results with very low smtx values, so there may be anoth er variable in play as well.

Another data point is that in every run, the same two CPUs (6 and 7) handled the interrupts for the vnet device. Here is the intrstat output, and it is con firmed by the mpstat output above:

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
       vdc#0 |         0  0.0         0  0.0         0  0.0         4  0.0
      vnet#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      vnet#1 |         0  0.0         0  0.0         0  0.0         0  0.0

      device |      cpu4 %tim      cpu5 %tim      cpu6 %tim      cpu7 %tim
-------------+------------------------------------------------------------
       vdc#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      vnet#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      vnet#1 |         0  0.0         0  0.0      3973  3.6      3980  3.6
So the first conclusion I could draw was that if the interrupt handling and whatever generates the spin mutexes is on the same two CPUs, then iperf performance is badly affected.

I will follow up this blog entry with more analysis and some workarounds.

Notes

I was running with iperf 2.0.4. oaf381-ld-1 is the server, oaf381-ld-2 and oaf 381-ld-3 are the clients. It is invoked on the server as:
iperf204 -c 192.1.44.2 -f m -t 120 -N -l 1M -P 100 &
iperf204 -c 192.1.44.3 -f m -t 120 -N -l 1M -P 100 &
and on the clients as:
iperf204 -s -N -f m -l 1M

Tuesday Sep 01, 2009

Bad practices exposed

There is an interesting (and frightening!) study recently published in the ACM Transactions on Storage: A nine year study of file system and storage benchmarking together with a short summary of the results and a set of recommendations.

It surveys nine years worth of papers on file system and storage benchmarks and makes for sobering reading:

We found that most popular benchmarks are flawed, and many research papers used poor benchmarking practices and did not provide a clear indication of the system's true performance.
and:
Finally, only about 45% of the surveyed papers included any mention of statistical dispersion.
We can only hope that the paper will raise awareness of the important of rigor and thoroughness in performance benchmarking.

Calendar

Feeds

Search

Links

Navigation

Referrers