Friday May 09, 2008

Thanks to everyone that attended my CommunityOne presentation earlier this week. The slides can be downloaded from here (search for Microparallelism) [Username: contentbuilder Password: doc789]-- there are significantly more Microparallelism examples than contained in the recent MultiCore Expo presentation.


Tuesday Apr 29, 2008

My slides from Multicore are now available on the OpenSPARC website here.


Thursday Apr 10, 2008

I have a more detailed presentation on Microparallelism at the upcoming CommunityOne conference (May 5th). Details here.

Thursday Mar 13, 2008

I have an article titled "Memory-Link Compression Schemes: A Value Locality Perspective", that will appear in the June issue of IEEE Transactions on Computers. The article can be found here.

Tuesday Mar 11, 2008

I'll be presenting at the upcoming Multicore Expo on "Multicore Processors and Microparallelism". The agenda for the conference can be found here.

Wednesday Dec 05, 2007

Located here

Thursday Oct 04, 2007

In comparison to using onchip crypto accelerators, the use of offchip, look-aside, accelerators, will tend to increase CPU utilization, consume additional I/O bandwidth and introduce additional latency. This tends to make the use of offchip cards problematic for the effective acceleration of bulk ciphers, especially for small or moderately sized packets. While recent announcements for `high-performance' offchip accelerators, using HT or FSB connectivity, may help reduce some of these issues, repeatedly ping-ponging the data off and on chip is inevitably less efficient that using on-chip accelerators that are tightly coupled with the processor cores.


Finally, while in-line (bump-in-the-wire) offchip accelerators can overcome some of these issues, there are other associated issues with this approach – a subject for another day.

Hopefully, I will have actual #s to illustrate these differences in more detail shortly.

[N.B. FIPS140-2 requirements etc. may also dictate choices]

Monday Sep 17, 2007

Good article on the UltraSPARC T2 processor located here.

Monday Aug 20, 2007

While parallelism normally focusses on large trip count loops for its performance gains, there are situations where this restricted focus can be problematic:

Small tasks:for applications that are primarily composed of short duration tasks, interposed by synchronization points (e.g., multi-phase operations), parallelization may be impossible (or very challenging).

Single-threaded component:even when an application's hot loops have been successfully threaded, any remaining single-threaded components can rapidly become a bottleneck as the application is scaled to larger numbers of threads (courtesy of Amdahl's law). For example, if 10% of an application is single-threaded, the performance gain obtained from using eight threads is limited to 4.7X. If an additional 5% of the application can be threaded, the performance gain is increased to almost 6X, clearly illustrating the importance of achieving as close to complete parallelization as possible.

Critical threads: the scalability of MT applications can be limited by the performance of certain critical threads. If the work undertaken by these critical threads cannot be further subdivided, performance may stop scaling when the critical threads become 100% busy.

Critical sections: given that only one thread can occupy a critical section at once, if threads spend too long in a critical section, application scalability can suffer, as other threads stall waiting for access to the same section. While the time in the critical section may only account for a

small portion of the total processing undertaken by a thread, minimizing the time spent in the critical section may result in significant performance benefits.

In the above examples, the performance of MT applications is adversely impacted by small single-threaded sections. These single-threaded sections can be broadly divided into two categories; those which are intrinsically serial in nature, and those which are amenable to parallel processing, if the associated threading overheads can be reduced.

While the serial segments remain problematic on CMP systems, the low inter-thread communication overheads (resulting from the shared L2$) will allow even very small tasks to be cost effectively threaded. In essence, CMP systems allow `micro-parallelization'. The potential benefits of leveraging micro-parallelization are widespread and can be used to address all of the performance problems discussed in the previous paragraphs. To illustrate the benefits of microparallelization, consider Fig. (i). In this example, a simple copy operation (such as that performed by bcopy or memcpy) is divided amongst multiple worker threads using lock-free synchronization. Each worker thread performs a portion of the copy and, as the number of threads is increased, the work undertaken per thread decreases proportionally. Fig. (ii) illustrates performance as the number of threads is increased, for three different copy sizes. For the Niagara CMP system, performance scales almost linearly with the number of threads, even when each thread is only copying 64 elements. In contrast, for the traditional SMP system, while acceptable scaling is obtained for the 8192-element copy, synchronization overheads are significant and performance regressions are observed at both increased levels of threading and for the smaller copies — the coherency overheads associated with synchronizing the master and worker threads quickly outweighing the performance benefits associated with the additional worker threads.


From Fig (ii), it is apparent that with the advent of CMP systems, many small, short duration tasks, which we traditionally viewed as single-threaded (.e.g., small memcpy operations) can now be cost effectively threaded. While the indiscriminate threading of these operations in MT applications is not beneficial, these micro-parallelization techniques can be used to alleviate scaling bottlenecks by improving the performance of these problem components.






Techniques Going forward, we need to start to leverage these 'micro-parallelization' techniques more aggressively and exploit the full potential of CMTs.

[Abstracted from the IJPP publication -- look here for more details]

Thursday Aug 16, 2007

While the vast majority of key commercial applications (including databases, webservers and application servers) have been carefully optimized over the years to ensure scalability on traditional MP systems, this can be a time consuming and costly process. Poor scalability is frequently observed due to software design issues, with problems such as hot locks and data sharing (both real and false) being common culprits. Traditionally, dealing with these problems has required detailed knowledge of both the application and the target system.


On CMP systems such as UltraSPARC T1 (Niagara 1), because threads share a common L2 cache, these problems have a much smaller impact on scalability. For instance, consider the code fragment in Fig. (i). In this example, each thread processes a separate array, accumulates a local total and then updates the global accumulation total. To ensure multiple threads can not update the global total in parallel, the update is protected via a mutual exclusion lock. Figure (ii) illustrates aggregate throughput as the number of threads is increased and presents results for 2 systems: an 8-core UltraSPARC T1 CMP system and a traditional 8-processor UltraSPARC SMP system. Figure (ii) illustrates that, as expected, performance on the traditional SMP system scales poorly as the number of threads is increased — due to the overheads associated with continually migrating the lock between processors. In contrast, on the UltraSPARC T1 CMP system, throughput scales almost linearly as the number of worker threads is increased — this is to be expected as the lock is retained in the T1's shared L2 cache for the duration of processing. This sharing of the hot lock across multiple processors is clearly problematic. While the example code is very simple (and the array size small), it is apparent that these problems can still have a noticeable impact on application scaling for more complex codes.


It is apparent that the impact of common scalability problems can be much less pronounced on CMP systems, improving application scalability and significantly simplifying MT application development.

[Abstracted from the IJPP paper]

Friday Aug 10, 2007

Recent publication by James Laudon and myself discussing the CMP space:


Abstract

The performance of microprocessors has increased exponentially for over 35 years. However, process technology challenges, chip power constraints, and difficulty in extracting instruction-level parallelism are conspiring to limit the performance of future individual processors. To address these limits, the computer industry has embraced chip multiprocessing (CMP), predominately in the form of multiple high-performance superscalar processors on the same die. We explore the trade-off between building CMPs from a few high-performance cores or building CMPs from a large number of lower-performance cores and argue that CMPs built from a larger number of lower-performance cores can provide better performance and performance/Watt on many commercial workloads. We examine two multi-threaded CMPs built using a large number of processor cores: Sun’s Niagara and Niagara 2 processors. We also explore the programming issues for CMPs with large number of threads. The programming model for these CMPs is similar to the widely used programming model for symmetric multiprocessors (SMPs), but the greatly reduced costs associated with communication of data through the on-chip shared secondary cache allows for more fine-grain parallelism to be effectively exploited by the CMP. Finally, we present performance comparisons between Sun’s Niagara and more conventional dual-core processors built from large superscalar processor cores. For several key server workloads, Niagara shows significant performance and even more significant performance/Watt advantages over the CMPs built from traditional superscalar processors.


International Journal of Parallel Programming,Volume 35, Number 3 / June, 2007


Link to article

Tuesday Dec 06, 2005


Its amazing to see the pace at which Chip Multithreaded (CMT) processors are evolving. It hasn't been that long since we were all obsessed with high frequency, super-complex OOO processors. Diminishing returns on performance and outlandish power requirements soon put an end to a number of these chip projects and hastened the industry-wide move toward the CMT design-point.


In the last few years alone we have seen CMT processors evolve through 2 generations - starting with the practice of just putting two uniprocessors on the same die (nothing being shared between the two cores but the offchip resources), and more recently moving to a more integrated design point, where the cores share an onchip level-2 cache (a number of obvious reasons why this sharing could be beneficial).


With the release of the UltraSPARC T1 (code-named Niagara), the next-generation of CMT processors is starting to arrive. Rather than just reusing uniprocessor designs, we are seeing the design of the processors tailored to a CMT design point. In the case of the UltraSPARC T1, this design point is commercial server workloads, such as databases, web servers, and application servers.


Server workloads are broadly characterized by high levels of thread-level parallelism (TLP), low instruction-level parallelism (ILP) and large working sets. The potential for further improvements in overall single-thread CPI is limited, but significant performance gains can be observed by leveraging the available TLP -- providing support for many simultaneous hardware threads of execution via a combination of support for multiple cores (Chip Multiprocessors (CMP)) and Multi-Threading (MT).


Sun's UltraSPARC T1 processor provides support for 32 hardware threads using 8 4-way vertically threaded cores. In comparison to other sever processors, each of these hardware threads is fairly modest (lower frequency, smaller issue-width etc.). However, the aggregate performance of the 32 such hardware threads that comprise the UltraSPARC T1 is significant, often providing several fold the performance of existing dual-core designs. And, given the almost cubic dependence between core frequency and power consumption, it does so at a fraction of the power of other solutions!


In Sun's Advanced Processor Architecture group (APA), we have been focusing on next-generation CMT processors for some time and talk more about some of the opportunities and challenges associated with this design trend in a recent publication at the International Symposium on High Performance Computer Architecture (HPCA'05), which can be found here along with the slide set for the presentation.


Another topic we have been investigating is how well the server CMT design point fits with other classes of application. The results have been encouraging, with CMT server processors delivering great performance at a fraction of the power associated with more traditional processors.


One interesting application space is that of Bioinformatics. In this space, significant effort is expended comparing DNA, RNA or protein queries against large (multi-GB) databases of sequences. A variety of different applications have been developed to identify similarities between the query and the sequences in the database. Probably the best known such application is BLAST.


These databases are composed of literally millions of different sequences, so there is an abundance of available parallelism. Most of these applications, including BLAST, have been coded as multithreaded applications and have been widely demonstrated to scale well.


We have been experimenting with T2000 systems running both multithreaded and single-threaded BLAST configurations and have found that performance scales almost perfectly with the number of cores utilized i.e. Performance observed with 8 cores (32-threads) is almost 8X the performance observed using 1-core (4-threads).



Looks like T2000 could be a nice fit in the Bioinformatics space. Stay tuned....



This blog copyright 2008 by sprack