Sun's CMT servers (currently T5220/T5240 and T6320/T6340 blade server modules) are continuing to gain popularity, and most commercial applications can be very performant on these platforms.
I have been working with quite a few software vendors in help tuning application performance on CMT platforms, most of them processing 'request/response' or event load, and observed some recurring patterns:

  1. when first deployed on CMT platforms, some application initially achieve lower than expected throughput
  2. CPU utilization is initially exceptionally low
  3. increasing application parallelism, in many cases by configuring more threads per process, or adding more processes to distribute the load on, increases CPU utilization, and as a result increases throughput, sometimes in an almost linear scale as the CPU utilization growth.
  4. Throughput is now higher than initially expected...
Before any kind of other tuning is considered, first concentrate on increasing application parallelism. If you are facing low throughput and low CPU utilization, you most probably do not have enough threads which are able to run simultaneously. Having lots of threads, which cannot run simultaneously (because of either inter-thread locking or any other reason) does not help. Use 'mpstat <interval-in-seconds>' command (mpstat (1M)) to determine that the load is well spread between virtual threads. If not, there is not enough parallelism. Use 'prstat -mL' (prstat (1M)) command to see the percentage of time your threads are making real use of the CPU, or are being blocked or wait for something.


Multi-threaded processes, including applications running in Java EE application servers, are in many cases bounded by their in-process scalability. When running more than a specific number (which is an application dependent. There are of course cases of applications with completely de-coupled threads with no scalability boundary) of concurrent threads( say, N), the threads start locking each other in high percentages (of time), not allowing more than N threads to effectively run together. In such cases, if your application allows this, try add more instances of the same process (like more application server instances). Spread the load between those instances. If more than one instance cannot live together on the same machine, try splitting the machine with Solaris containers or logical domains.


If you cannot add more simultaneous threads or processes, but you do have many threads with, heavily locked, try to look for the cause of the locking. Use 'plockstat' command (plockstat (1M)) for this. For multithreaded processes, in most cases you will gain parallelism by using the libumem allocation library, which allows simultaneous thread allocations (set LD_PRELOAD environment variable to /usr/lib/libumem before starting the application. More details here).
For Java application, make sure to use a parallel garbage collector (and, when
applicable, move to a more recent Java version, there is usually an improvement in most aspects of parallelism).


There are much more fine tunings that can be done, but in most cases, getting to a reasonable CPU utilization with the above considerations, leads to most of the performance possible gain.
    Comments:

    Post a Comment:
    • HTML Syntax: NOT allowed

    This blog copyright 2009 by Amit Hurvitz