Monday May 12, 2008
A common observation regarding Niagara based servers is that system maintenance or database administration tasks can run slower than previous generations of Sun servers. While single-threaded performance may be less, these maintenance tasks are often able to be parallelized, especially using a database engine as mature as Oracle.
Take for instance the task of gathering schema statistics. Oracle offers many options on how to gather schema statistics, but there are a few ways to reduce overall gather statistics time:
- Increased Parallelism
- Reduced Sample Size
- Concurrency
Oracle has written many articles in
metalink which discuss sample size and the various virtues. There have also been many volumes written on optimizing the Oracle cost based optimizer (CBO).
Jonathan Lewis of who is a member of the famous
Oaktable network has written books and multiple white papers on the topic. You can read these for insight into the Oracle CBO. While a reasonable sample size or the use of the "DBMS_STATS.AUTO_SAMPLE_SIZE" can seriously reduce the gather statistics times, I will leave that up to you to choose the sample size the produces the best plans.
Results
The following graph shows the total run time in seconds of a "GATHER_SCHEMA_STATS" operations at various levels of parallelism and sample size on a simple schema of 130GB. All tests were run on a Maramba T5240 with a 6140 array and two channels.
Note that if higher levels of sampling are required, parallelism can help to significantly reduce the overall runtime of the GATHER_SCHEMA_STATS operation. Of course a smaller sample size can be employed as well.
GATHER_SCHEMA_STATS options
SQL> connect / as sysdba
-- Example with 10 percent with parallel degree 32
--
SQL> EXECUTE SYS.DBMS_STATS.GATHER_SCHEMA_STATS (OWNNAME=>'GLENNF',
ESTIMATE_PERCENT=>10,
DEGREE=>32,
CASCADE=>TRUE);
-- Example with AUTO_SAMPLE_SIZE and parallel degree 32
--
SQL> EXECUTE SYS.DBMS_STATS.GATHER_SCHEMA_STATS (OWNNAME=>'GLENNF',
ESTIMATE_PERCENT=>DBMS_STATS.AUTO_SAMPLE_SIZE,
DEGREE=>32,
CASCADE=>TRUE);
Note that you must have "parallel_max_servers" set to at least the level of parallelism desired for the GATHER_SCHEMA_STATS operation. I typically set it higher to allow for other parallel operations to get servers.
SQL> alter system set parallel_max_servers = 128;
Finally, you can easily run a schema collect on multiple schema's concurrently and in parallel by issuing GATHER_SCHEMA_STATS from multiple sessions and ensuring the level of parallelism is set high enough to accommodate.
Configuration
Monday Apr 21, 2008
Since we just recently announced mutli-chip based CMT servers that provide up to 128 threads in a 1U or 2U box, it seems fitting to pick up this thread on throughput computing.
The key to fully appreciating the CMT architecture with Oracle is to exploit the available threads. As I have spoke about earlier in the "throughput computing series", this can be done through "concurrency", "parallelism", or both. Oracle, being the mature product that it is, can achieve high-levels of parallelism as well as concurrency.
Concurrent processing with Oracle
For examples of concurrent processing with Oracle, look at recent results on the Oracle Ebusiness payroll
benchmark. This shows that using concurrent processes to break up the batch, you can increase batch throughput. By going from 4 to 64 processes, batch time decreased from 31.53 minutes to 4.63 minutes and throughput was increased by 6.8x!
With Oracle's Ebusiness Suite of software, you can increase the number of "concurrent manager" processes to more fully utilize the available threads on the system. Each ISV has different ways of controlling batch distribution and concurrency. You will have to check with your various software vendors to find all the ways to increase concurrency.
Parallelism in Oracle
People often associate parallelism in Oracle with parallel query. In most cases where CMT is involved, I see a lack of understanding of how to achieve parallelism with more basic administrative functions. Oracle excels in providing parallelism for important administrative tasks.
Schema analyze
Index build/rebuild
Parallel loader
Parallel export/import with datapump
While parallelism exists for these administrative tasks, some configuration is required. I will examine the various ways to achieve optimal throughput with CMT based systems on these tasks.
Wednesday Jan 16, 2008
Most environments have some open source SW that is used as part of the application stack. Depending on the packages, this can take a fair amount of time to configure and compile. To speed the install process, parallelism can easily be used to take advantage of the throughput of CMT servers.
Let us consider the following five open source packages:
- httpd-2.2.6
- mysql-5.1.22-rc
- perl-5.10.0
- postgresql-8.2.4
- ruby-1.8.6
The following experiments will time the installation of these packages in both a serial, parallel, and concurrent fashion.
Parallel builds
After the "configure" phase is complete, these packages are all compiled using
gmake. This is where parallelism within each job can be used to speed the install process. By using the
"gmake -j" option, the level of parallelism can specified for each of the packages. This can dramatically improve the overall compile time as seen below.
- Jobs were ran in a serial fashion but with parallelism within the job itself.
- 79% reduction in compile time at 32 threads/job.
Concurrency and Parallelism
The build process for the various packages are not each able to be parallelized perfectly. In fact, the best overall gain of any of the packages is 6x out of 32. This is where concurrency comes into play. If we start all the compiles at the same time and use parallelism as well, this further reduces the overall build time.
- All 5 jobs were run concurrently with 1 and 32 (threads/job).
- 88% overall reduction in compile time from serial to parallel with concurrency.
- 42% reduction in compile time over parallel jobs ran serially.
Load it up!
Hopefully, this helps to better describe how to achieve better system throughput through parallelism and concurrency. Sun's CMT servers are multi-threaded machines which are capable of a high level of throughput. Whether you are building packages from source or installing pre-build packages, you have to load up the machine to see throughput.
Monday Jan 14, 2008
In this installment of the throughput computing series, I will explore how to get parallelism from the system point of view. The system administrator who first begins to configure the system will start forming impressions from the moment the shrink wrap comes off the server. First impressions and potential parallel options will be explored in this entry.
Off with the shrink-wrap... on with the install
Unfortunately, most installation processes involve a fair number of single-threaded procedures. As mentioned before, the CMT processor is designed to be a processor that optimizes the overall throughput of a server - often to the detriment of single threaded processes. There are several schools of thought on this one. First is, why bother - the install process happens but once and it really doesn't matter. That is true for most typical environments. But the current trend toward grid computing and virtualization makes "time to provision" often a critical factor. To help speed provisioning, there are some things that can be done by using parallelized commands and concurrency.
pbzip2 to the rescue
A very common time-consuming part of provisioning is the packing/unpacking of SW packages. Commonly, gzip or bzip is used to unpack data and packages, but this is not a parallel program. Fortunately, there is a parallel version of bzip that has been made available. "pbzip2" allows you to specify the level of parallelism in order to speed the compression/decompression process.
I spent a little time experimenting with the pbzip program after repeated interactions that always seemed to come back to "gzip" performance. I decided to do some quick benchmarks with pbzip2 using both the T2000(8core@1.4GHz) and v20z(AMD 2cores@2.2GHz).
pbzip2 benchmark
The setup used a 135M text file. This file was the trade_history.txt created using the egen program distributed by the tpc council for the TPC-E benchmark. This file was compressed using the following simple test script:
At lower thread counts, the v20z with two AMD cores does better. This is expected since the AMD x64 processor is optimized single-threaded performance. But you can see as you crank up the thread count, the T2000 starts to really shine. This demonstrates my main point that to push massive throughput within a single application, you need lots of threads and parallelism.
...The next entry will explore how concurrency and parallelism can help improve build times.
Tuesday Jan 08, 2008
This is the first installment in a series of entries that discuss different aspects of throughput computing. This series aims to improve the understanding of how SPARC CMT servers can be utilized to increase business throughput. Let's start with a definition.
What is throughput computing?
Oxford American defines "throughput" as:
"The amount of materials or items passing through a system or process"
In computer terms, throughput computing is the amount of "work" that can be done in a given period of time. Things like "orders per second", "paychecks per hour", "queries per second", "webpages per minute",... are all metrics of throughput. These measures help define the amount of work a system can complete in a given period of time.
Misguided throughput metrics
- Latency or Response time is not a throughput metric.
- CPU % is not a throughput metric.
- IO wait% is definitely not a throughput metric... or anything other than a measure of idle time
- The "Load average" of a system is not a measure of throughput.
OK... you get the idea.
Job level parallelism
Job level parallelism is about taking a single job and breaking it into multiple pieces. Say you have 10,000 letters to put stamps on. If it takes 3 seconds per letter, you would need 30,000 seconds or more than 8 hours to complete the task. Now consider you are a teacher and you bring the letters to class. There are 20 students in the class so each student will place stamps on 500 letters. With only 500 letters to complete per student, the job can be done in only 1500 seconds or 25 minutes.
In terms of throughput, one person processes one letter every 3 seconds or 60/3 = 20 letters per minute... and a class of 20 students can process 20*20 = 400 letters per minute.
Concurrency
Concurrency comes from running multiple jobs or applications together on a system. A job may be single-threaded or use multiple threads of execution as discussed above. These jobs need not be related or even from the same application. To further increase concurrency, virtualization is often used to run multiple concurrent OS images on the same machine in-order to take advantage of modern multi-threaded systems.
Putting it all together with Chip Multi-Threading
Denis Sheanan sums up Sun's throughput computing initiative in his
paper
on CMT as:
"Sun’s Throughput Computing initiative represents a new paradigm in system design, with a
focus toward maximizing the overall throughput of key commercial workloads rather than the speed of a single thread of execution. Chip multi-threading (CMT) processor technology is key to this approach, providing a new thread-rich environment that drives application throughput and processor resource utilization while effectively masking memory access latencies."
The salient point is that you must have an application that has multiple threads of execution in-order to take advantage of CMT. Multiple threads of execution could come from a single job that has been parallelized or from multiple jobs of different types that run concurrently.
Resources
Monday Jan 07, 2008
I was thinking about the development of a CMT throughput benchmark, but it occurred to me that there are many *good* examples of throughput already out with the benchmarks we publish... just look at the bmseer postings on the Recent T2 results
and the long line of performance records on the T2000.
The biggest disconnect with CMT servers is a misunderstanding of throughput and multi-threaded applications. I made a posting last
year
which touched on some initial impressions, but I thought it would be a good idea to dig in further.
This entry is to kick off a series of postings that explore different aspects of throughput computing in a CMT environment. The rough outline is as follows:
Overview
- Definition of Throughput computing, multi-threading, and concurrency.
Explore system parallelism
- Unix commands and parallel options
- Concurrent builds/compiles.
- Configuring the system for parallelism
Configuring applications for parallelism
- Concurrency vs multi-threading
- Single-Threaded jobs
Database parallelism with Oracle
- Parallel loader and datapump
- Index build parallelism
- Concurrent processing in Oracle
- Configuring Oracle for CMT servers
are you setting the degree to 32 and parallel_max_...
This was just an example, so don't read anything s...
Is there an reason the chart is starting with degr...
I didn't do runs with parallel degree=1... I ran o...