David Dice's Weblog

Wednesday Sep 23, 2009

The perils of negative scalability

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry.

Lets say you have an application that exhibits negative scalability. That is, if you were to plot throughput on the Y-axis and concurrency on the X-axis the shape of the curve would
be convex downward -- performance climbs up to an apex and then falls off. (How can this happen? A common reason is that the communication overheads start to dominate and the
benefit of concurrency is overcome by communication and synchronization costs). Under such circumstances it's common to introduce some type of admission control -- say, simple
back-off or more elaborate mechanisms -- to restrict concurrency. Ideally, this yields an asymptotic curve where performance remains constant after reaching the peak, avoiding any
fall-off.

If you tune the performance of such an application using the usual measure-analyze-modify cycle but pay attention only to the throughput values at high concurrency levels then you
might be badly misled. The usual development feedback loop can fail because poor "optimizations" that slow down the code may actually serve as inadvertent implicit back-off (contention
management) that will attenuate the negative scalability at higher concurrency levels but also needlessly impair performance at lower concurrency levels. Ideally, back-off should
be applied only as needed, in response to contention.

A related effect is that inserting diagnostic probes might yield better performance in the region of negative scalability because of probe overhead -- a performance "Heisenbug" where
performance improves when execution is more closely observed.

The take-away is that we should be careful to measure the performance of any proposed change over a wide range of concurrency values, and not just at the extremes.

Comments:

I've seen it happen usually related to access to databases. You correct a bottleneck in your application, and all of a sudden, all those requests "fly" straight to the database and bring it dow to its knees, ending up with worse performance, because you did not know the bottleneck was actually "protecting" the database from excessive load.
Nothing that cannot be fixed, but it is true that it is sometimes difficult to explain how "fixing" something can result in worse performance... until you fix the hidden-and-now-uncovered problem.
S!

Posted by GreenEyed on September 28, 2009 at 07:25 AM EDT #

Good point ...

This type of effect (or close analogs) manifests in quite a number of ways. A few others that come to mind are CSMA-CD communication protocols when arbitrating for a share resource (spectrum, access to the physical "wire", etc) and "anticipatory scheduling" (http://en.wikipedia.org/wiki/Anticipatory_scheduling).

Regards, Dave

Posted by 192.18.128.5 on September 28, 2009 at 10:16 AM EDT #

See also: www.cs.sfu.ca/~fedorova/papers/wiosca06.pdf.

Alexandra Fedorova, Margo Seltzer and Michael D. Smith, A Non-Work-Conserving Operating System Scheduler for SMT Processors , In Proceedings of the Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), in conjunction with ISCA-33, June 2006

Posted by David Dice on October 26, 2009 at 04:42 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
Copyright © 2006-2010 Dave Dice, All Rights Reserved.

Calendar

Feeds

Search

Links

Navigation

Referers

XML Feed
Creative Commons License
This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 License.

The views expressed on this [blog; Web site] are my own and do not necessarily reflect the views of Oracle.