While
parallelism normally focusses on large trip count loops for its
performance gains, there are situations where this restricted focus
can be problematic:
•Small
tasks:for applications that
are primarily composed of short duration tasks,
interposed by synchronization points (e.g., multi-phase operations),
parallelization may be impossible (or very challenging).
•Single-threaded
component:even
when an application's hot loops have been
successfully threaded, any remaining single-threaded components can
rapidly become a bottleneck as the application is scaled to larger
numbers of threads (courtesy of Amdahl's law). For example, if
10% of an application is single-threaded, the performance gain
obtained from using eight threads is limited to 4.7X. If an
additional 5% of the application can be threaded, the performance
gain is increased to almost 6X, clearly illustrating the importance
of achieving as close to complete parallelization as possible.
•Critical
threads: the scalability of
MT applications can be limited by the performance of
certain critical threads. If the work undertaken by these critical
threads cannot be further subdivided, performance may stop scaling
when the critical threads become 100% busy.
• Critical
sections: given that only one thread can
occupy a critical section at once, if threads spend too
long in a critical section, application scalability can suffer, as
other threads stall waiting for access to the same section. While the
time in the critical section may only account for a
small
portion of the total processing undertaken by a thread, minimizing
the time spent in the critical section may result in significant
performance benefits.
In
the above examples, the performance of MT applications is adversely
impacted by small single-threaded sections. These single-threaded
sections can be broadly divided into two categories; those which are
intrinsically serial in nature, and those which are amenable to
parallel processing, if the associated threading overheads can be
reduced.
While
the serial segments remain problematic on CMP systems, the low
inter-thread communication overheads (resulting from the shared L2$)
will allow even very small tasks to be cost effectively threaded. In
essence, CMP systems allow `micro-parallelization'. The
potential benefits of leveraging micro-parallelization are widespread
and can be used to address all of the performance problems discussed
in the previous paragraphs. To illustrate the benefits of
microparallelization, consider Fig. (i). In this example, a simple
copy operation (such as that performed by bcopy or memcpy) is divided
amongst multiple worker threads using lock-free synchronization. Each
worker thread performs a portion of the copy and, as the number of
threads is increased, the work undertaken per thread decreases
proportionally. Fig. (ii) illustrates performance as the number of
threads is increased, for three different copy sizes. For the Niagara
CMP system, performance scales almost linearly with the number of
threads, even when each thread is only copying 64 elements. In
contrast, for the traditional SMP system, while acceptable scaling is
obtained for the 8192-element copy, synchronization overheads are
significant and performance regressions are observed at both
increased levels of threading and for the smaller copies — the
coherency overheads associated with synchronizing the master and
worker threads quickly outweighing the performance benefits
associated with the additional worker threads.
From
Fig (ii), it is apparent that with the advent of CMP systems, many
small, short duration tasks, which we traditionally viewed as
single-threaded (.e.g., small memcpy operations) can now be cost
effectively threaded. While the indiscriminate threading of these
operations in MT applications is not beneficial, these
micro-parallelization techniques can be used to alleviate scaling
bottlenecks by improving the performance of these problem components.

Techniques Going
forward, we need to start to leverage these 'micro-parallelization'
techniques more aggressively and exploit the full potential of CMTs.
[Abstracted from the IJPP publication -- look here for more details]