Friday Jun 29, 2007
In my previous post, I pointed out some considerations to deploying the Swingbench Order-Entry benchmark on large systems. The main bottle-neck in this case was the database size. When scaling too small of a database to huge transaction rates, concurrency issues in the data prevent scaling. Luckily, Swingbench has a way to adjust the number of "Users" and "Orders"... or so it would seem.
Adjusting Users and Orders
I used the "oewizard" utility to create the maximum number of customers and orders - 1 million each. This created a database that was about 65GB total. The "oewizard" is a single threaded process and therefore takes a little time... Be patient. After doing my 1st run, I was a little concerned at the difference in performance.
Scale-up differences
In the real-world as database size grows, often transactions bloat. This is often noticed by enterprising DBAs and performance analysts. Eventually, this will lead to a re-coding of SQL or some changes in the transaction logic. So as a real-world database scales-up it will go through a series of bloating and fixing.
When designing a benchmark to show scale-up and make comparisons of systems at various database sizes, it is desirable to ensure transactions are presented with a similar load. If this is not the case, it should be noted and comparisons should NOT be made across database sizes. The "Order Products", "New Registration", and "Browse Order" transactions which are part of the SwingBench Order-Entry test, all experience transaction bloat as the database size is increased.
The following response time chart shows the effects of "one" user running on databases of 25,000 and 1,000,000 orders.
The moral-- beware of comparing results of differing database sizes using the Swingbench default Order-Entry kit.
Monday Jun 11, 2007
I applaud tools that aim to make life easier. The cell phone is a wonderful invention that when combined with my palm pilot was wonderful. Now Apple has taken it as step further with the music, movies, internet and birthed the iPhone - nicer still!
Over the past year, I have been seeing more and more IT shops experiment with benchmark tools. One such tool is a kit developed by Dominic Giles of Oracle called Swingbench. Swingbench is a benchmark toolkit that is easy to install and run. Now the DBA can install the benchmark schema and with a few clicks... Wham they are benchmarking! Now comes the hard part - What do these results mean?
After about the 4th call of a customer having performance issues with their application "Swingbench", I was compelled to take a deeper look.
Luckily, all of the performance problems were easily solved by someone who benchmarks for a living. They were typically misconfiguration issues like: filesystem features, lack of io, lack of memory, too small of a dataset, ect... The scary part, these situations all used the supplied "demo" schema's.
By pursuing the Swingbench documentation, I saw that the demo schema's top out at a 100GB database size. This is also alarming. Most IT shops that buy servers or deploy multi-node RAC configurations have more disk than the modern laptop. So you can imagine my surprise when I saw a bake-off of an enterprise class machine that is essentially doing no IO and choking to death on latches... simply the wrong test for the environment.
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free 4,542,675 1,137,914 79.04
log file sync 242,359 164,671 11.44
buffer busy waits 102,540 61,887 4.30
enqueue 35,142 42,498 2.95
CPU time 25,310 1.76
Benchmarking, is simply not *that* easy. It takes time to scale up a workload that can simulate your environment. No question that Swingbench gives you a nice head start. It allows you to encapsulate your transactions, run simple regression tests, but you have to take the time to customize the kit to include your data and transactions. The demo schema's are simply a starting point.
Wednesday Nov 15, 2006
Using the filesystem page cache in combination with the Oracle buffer cache for database files was commonplace before 64-bit databases were prevalent - machines had a lot of memory and databases could not use more than 4GB. Now after many years of 64-bit databases, there are still a fair number of systems that still use buffered IO via the filesystem or cached QIO. While buffered IO used to provide benefit, it can cause substandard performance and impede scaling on modern large-scale systems. Buffered file system issues include:
- Single-writer lock
- Memory fragmentation: 8k blocks instead of 4M or 32M ISM.
- Double buffering doubles memory bandwidth requirements!
- Segmap thrashing... lots of xcalls!
2x throughput increase with Oracle Caching vs OS buffered IO
A quick experiment was conducted on Oracle 10gR2 on a large memory machine (192GB).
- 1st test: DB cache was set to 1GB and the database was mounted on a buffered file system.
- 2nd test: DB cache was set to 50GB and the database was mounted direct - NOT buffered.
A 46GB table was populated, indexed, and then queried by 100 processes each requesting a range of data. A single row was retrieved at a time to simulate what would happen in an OLTP environment. The data was cached so that no IO occurred during any of the runs. When the dust had settled, the Oracle buffer cache provided a
2X speedup over buffered file systems. There was also a dramatic decrease in getmaps, xcalls, and system CPU time. The table below shows the results.
|
Cache
|
OS
|
Rows/sec
|
getmaps/sec
|
xcalls/sec
|
Usr
|
sys
|
|
FS
|
S9
|
287,114
|
86,516
|
2,600,000
|
71
|
28
|
|
DB
|
S9
|
695,700
|
296
|
3,254
|
94
|
5
|
|
FS
|
S10
|
334,966
|
106,719
|
1,003
|
78
|
21
|
Notice that cross calls for the Solaris 10 with FS cache have been nearly eliminated while the getmaps have increased in proportional to throughput. This is due to the elimination of xcalls associated with the getmap operation. That said, the mild improvement in throughput with S10 on filesystems, it is nothing like the 2x improvement achieved by avoiding buffered IO altogether.
Recognizing buffered IO impact
A higher amount of system CPU time can be observed at the high-level. It is not uncommon to see a usr/sys ratio of 1 or less on systems where buffered IO is in use. This is due to the high number of getmap reclaims and cross-calls (xcal). You can observe cross-calls with
mpstat(1) command. Segmap activity can be best observed using segmapstat utility which is part of the
cachekit utilities. The segmapstat utility polls "kstats" to retrieve hit/miss data in an easy to read format. If you are using Solaris 10, the impact due to cross-calls is less, but segmap activity is still visible.
Finally, it would be nice to be able to see the amount of data in the page cache. If you are on Solaris 8, you will need to install the memtool 8 written by Richard McDougal. If you are on Solaris 9 or greater, you can use mdb(1) with the ::memstat command. Beware, this command will take a long time to run and may affect performance, therefore it is best to run this when the system is not busy.
# mdb -k Loading modules: [ unix krtld genunix ip usba wrsm random
ipc nfs ptm cpc ]
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ---
Kernel 430030 3359 2%
Anon 805572 6293 3%
Exec and libs 9429 73 0%
Page cache 14974588 116988 52%
Free (cachelist) 2547680 19903 9%
Free (freelist) 9853807 76982 34%
Total 28621106 223602
How do you avoid using buffered IO?
The easiest way to avoid using the OS page cache is to simply use RAW partitions. This is commonly done in combination with SVM or VxVM. More recently, Oracle introduced their own volume manager (ASM) which makes use of async IO and eases the administration of Oracle databases. That said, databases on RAW partitions are not for everyone. Often users perfer to use standard OS tools to view and manpulate database files in filesystems.
Most filesystems have ways of bypassing the OS page cache for Oracle datafiles. UFS, QFS, and VxFS all support mounting filesystems to bypass the OS page cache - the only exeception is ZFS which doesn't allow for direct or async IO. Below, methods for disabling buffered IO with filesystems are discussed.
FILESYSTEMIO_OPTIONS=SETALL (Oracle 9i and greater) init.ora parameter
The first step to avoiding buffered IO is to use the "FILESYSTEMIO_OPTIONS" parameter. When you use the "SETALL" option, this sets all the options for a particular filesystem to enable directio or async IO. Setting the FILSYSTEMIO_OPTIONS to anything other than "SETALL" could reduce performance. Therefore, it is a best practice to set this option.
UFS and directio
With UFS, the only way to bypass the page cache is with directio. If you are using Oracle 9i or greater, then set the
FILESYSTEMIO_OPTIONS=SETALL init.ora parameter. This the preferred way of enabling directio with Oracle. With this method, Oracle uses an api to enable directio when it opens database files. This method allows you to still use buffered IO for operations like backup and archiving. If you are using Oracle 8i, then the only way to enable directio with UFS is via the
forcedirectio mount option.
VxFS with QIO
VxFS has several options for disabling buffered IO. Like UFS, VxFS does support directio but it is not as efficient as Quick IO (QIO) or Oracle Data Management (ODM). With VxFS, async IO is possible with QuickIO or ODM. Data files for use with QIO must be created with a special utility or converted to the QIO format. With QIO you have to be careful that the "cached" QIO option is not enabled. With the cached QIO option, blocks of selected data files will be placed in the OS page cache.
VxFS with ODM
Like QIO, ODM uses async IO. ODM uses an api specified by Oracle to open and manipulate data files. ODM lowers overhead in large systems by sharing file descriptors and eliminating the need for each oracle shadow/server process to open and obtain its own file descriptors.
Convincing Oracle to cache table data
Finally, after all this is done Oracle
still may not properly cache table data. I have seen more than a few persons enable "directio" and increase the SGA only to have response time of their critical queries take longer! If a table is too large or the "cache" attribute is not set, Oracle will not attempt to cache tables when scanning. This is done to avoid flooding the Oracle buffer cache with data that will most likely not be used. Luckily, there is an easy way to correct this behavior by setting the "CACHE" storage parameter on a table.
SQL> alter table BIGACTIVETABLE cache;
Finally, you may need to convince some of the Oracle DBAs of the benefit. DBAs look at Oracle performance data from an Oracle centric point of view. When data such as Oracle's statspack is analyzed, some pretty awsome response times can be seen. Wait events for IO such as "db file sequential read" and "db file scattered reads" can show response times of less than 1ms when reading from the OS page cache. Often when looking at such data, DBA's are reluctant to give up this response time. This should be viewed as an oppurtunity to further improve performance by placing the data in the Oracle buffer cache and avoiding the reads alltogether.
Summary and references
Hopefully this has given you some background on why unbuffered IO is so critical to obtain optimal performance with Oracle. It is far more efficient to obtain an Oracle blocks from the database buffer cache than to go through the OS page cache layers.
Saturday Nov 11, 2006
I had the pleasure of hearing an old Sequent friend Kevin Closson speak about NAS architecture at a recent OSWOUG meeting. It was an interesting and energetic discussion on the direction of NAS in commodity servers. If you are interested at all in the direction of storage technology for databases, you should check out Kevin's blog and paper on this technology.
Tuesday Oct 31, 2006
In light of the recent announcements by Oracle on the creation of "Larry Linux", I thought it would be prudent to comment about where Solaris fits into this picture. Oracle was hoping to fill a perceived hole in the support of Linux for the Enterprise. Oracle believes they can do this better than Red Hat. While won't comment on whether or not Oracle can support Linux better than Red Hat, I am looking forward to seeing the "tars" from Larry Linux
Anyway, this recent announcement led me to realize that Solaris "now more than ever" is the best choice to run your enterprise. If you want open, Solaris is open. If you want iron-clad Unfakeable, Unshakeable, Unbreakable, ... well Solaris has been running enterprises since before Linux was a twinkle in Torvalds eye. Finally, if it comes down to price - Oracle will not cut you any breaks on Database support... You may have to mortgage your house just to get a years support for Oracle. But if you want to save some pocket change on OS support, Solaris beats Red Hat and Larry Linux in that category as well.
Red Hat: $999
Larry Linux: $399
Solaris: $120
Seems like a pretty easy choice to me...
Monday Oct 30, 2006
You would think that the "DUAL" table, a simple stub table, would not be a performance topic - but I have seen this for years on high-end benchmarks. People develop applications or tests for applications which tend to over-use the DUAL table. Most commonly, this comes in the form of "select abc.nextseq from DUAL" and "select sysdate from DUAL". This is typically, not a problem for small severs with a low level of concurrency, but it can be bottle-neck on high-end severs with lots of processors.
The problem with DUAL (in Oracle 9i and below) is that this "fake table" hashes to a "real" cache line
If over-used it can cause a "cache buffers chains" latch contention like crazy. The most dangerous over-use situations are systemic ones. I can get around these issues in most benchmark environments, but cringe when I see the embedded use DUAL.
In BEA websphere, there is a parameter called "TestConnectionsOnReservere". This parameter sends a SQL statement to the database before *EVERY* user statement.... talk about overhead! This not only adds SQL*Net round trips increasing network use, but most commonly uses the "SQL SELECT 1 from DUAL" as the test statement
What is worse, the overhead just continues to increase as the load is increased. Ken Gottry discusses the performance impact in an article he wrote. This study used a 2-way server to show the performance impact. It is much worse on a high-end server.
What can you do?
Avoid setting the TestConnectionsOnReserve within BEA. The performance cost in terms of potential latch contention and network over-head is too high. If you must use this paramenter, use the "X$DUAL" table instead. Oracle 10g, uses this by default and while it avoids the latching issues, the networking component this parameter is still present.
Wednesday Aug 16, 2006
As part of the Second Edition of the famous
Solaris Internals and the new Solaris Performance and Tools book a performance tuning
Wiki has been created. This site is meant to be a living document where best practices, tuning information, and tips are collected.
I have began contributing Oracle performance information to the Solaris applications specific tuning Wiki. I hope you enjoy this repository of information regarding performance on Sun systems.
Friday Aug 04, 2006
"Why does Oracle call times() so often? Is something broken?
When using truss or dtrace to profile Oracle shadow processes, one often sees a lot of calls to "times". Sysadmins often approach me with this query.
root@catscratchb> truss -cp 7700
^C
syscall seconds calls errors
read .002 120
write .008 210
times .053 10810
semctl .000 17
semop .000 8
semtimedop .000 9
mmap .003 68
munmap .003 5
yield .002 231
pread .150 2002
kaio .003 68
kaio .001 68
-------- ------ ----
sys totals: .230 13616 0
usr time: 1.127
elapsed: 22.810
At first glance it would seem alarming to have so many times() calls, but how much does this really effect performance? This question can best be answered by looking at the overall "elapsed" and "cpu" time. Below is output from the "procsystime" tool included in the
Dtrace toolkit.
root@catscratchb> ./procsystime -Teco -p 7700
Hit Ctrl-C to stop sampling...
^C
Elapsed Times for PID 7700,
SYSCALL TIME (ns)
mmap 17615703
write 21187750
munmap 21671772
times 90733199 <<== Only 0.28% of elapsed time
semsys 188622081
read 226475874
yield 522057977
pread 31204749076
TOTAL: 32293113432
CPU Times for PID 7700,
SYSCALL TIME (ns)
semsys 1346101
yield 3283406
read 7511421
mmap 16701455
write 19616610
munmap 21576890
times 33477300 <<== 10.6% of CPU time for the times syscall
pread 211710238
TOTAL: 315223421
Syscall Counts for PID 7700,
SYSCALL COUNT
munmap 17
semsys 84
read 349
mmap 350
yield 381
write 540
pread 3921
times 24985 <<== 81.6% of syscalls.
TOTAL: 30627
According to the profile above, the times() syscall accounts for only 0.28% of the overall response time. It does use 10.6% of sys CPU. The usr/sys CPU percentages are "83/17" for this application. So, using the 17% for system CPU we can calculate the overall amount of CPU for the times() syscall: 100*(.17*.106)= 1.8%.
Oracle uses the times() syscall to keep track of timed performance statistics. Timed statistics can be enabled/disabled by setting the init.ora parameter "TIMED_STATISTICS=TRUE". In fact, it is an *old* benchmark trick to disable TIMED_STATISTICS after all tuning has been done. This is usually good for another 2% in overall throughput. In a production environment, it is NOT advisable to ever disable TIMED_STATISTICS. These statistics are extremely important to monitor and maintain application performance. I would argue that disabling timed statistics would actually hurt performance in the long run.
Tuesday Jul 11, 2006
There are multiple ways to gather trace data. You can instrument the application, pick an oracle sid from sysdba, turn on tracing for all users (ouch), or use a login trigger to narrow down to a specific user. Each of these methods have merit, but recently I desired to gather traces at various user levels.
The problem with most packaged applications, is that they all use the *same* userid. For this Oracle 10G environment, I used this fact to filter only connections of the type that I wanted to sample. I wanted to gather 10046 event trace data when the number of connections was 10, 20, or 30. To achieve this, I used a logon trigger and sampled the number of sessions from v$session to come up with the connection count. I have found this little trick to be very useful in automating collection without modifying the application. I hope this can be useful to you as well.
create or replace trigger trace_my_user
after logon on database
DECLARE
mycnt int;
BEGIN
SELECT count(*)
INTO mycnt
FROM v$session
WHERE username='GLENNF';
if (user='GLENNF') and ((mycnt=10) or (mycnt=20) or (mycnt=30)) then
dbms_monitor.session_trace_enable(null,null,true,true);
end if;
end;
/
Thursday Feb 16, 2006
There seems to be some confusion about how an Oracle
instance uses multi-core processors. From a database point of view, it
can use all CPU resource that is offered by Solaris. To find out what
CPU resources are available, use the "prtdiag" and "psrinfo" commands.
The example below shows
a single board USIV domain on a SF25K. There are a total of 4 * USIV Processors running 8 cores @1.2GHz each. The prtdiag output shows the 4 processors with two CPU ID's each.
The psrinfo command simply shows all 8 cores.
catscratchb:root> /usr/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Sun Fire 15000
System clock frequency: 150 MHz
Memory size: 32768 Megabytes
========================= CPUs =========================
CPU Run E$ CPU CPU
Slot ID ID MHz MB Impl. Mask
-------- ------- ---- ---- ------- ----
/SB01/P0 32, 36 1200 16.0 US-IV 2.2
/SB01/P1 33, 37 1200 16.0 US-IV 2.2
/SB01/P2 34, 38 1200 16.0 US-IV 2.2
/SB01/P3 35, 39 1200 16.0 US-IV 2.2
catscratchb:root> /usr/sbin/psrinfo
32 on-line since 02/07/2006 18:00:23
33 on-line since 02/07/2006 18:00:25
34 on-line since 02/07/2006 18:00:25
35 on-line since 02/07/2006 18:00:25
36 on-line since 02/07/2006 18:00:25
37 on-line since 02/07/2006 18:00:25
38 on-line since 02/07/2006 18:00:25
39 on-line since 02/07/2006 18:00:25
Oracle does size internal latching structures based on the number of CPUs available. This sizing does NOT disallow Oracle shadow process from using CPU resource, it simply makes the system better suited to scale. To find out how many CPUs Oracle thinks are available, run the
following SQL command as sysdba. We have found that Oracle sizes its data structures based on the number or CPU IDs that are reported by psrinfo.
SQL> connect / as sysdba
SQL> select ksppinm name, ksppstvl value, ksppdesc description
from x$ksppi x, x$ksppcv y
where (x.indx = y.indx)
and ksppinm like '%cpu_count%'
order by name;
NAME VALUE DESCRIPTION
--------------- ------ ----------------------------------
cpu_count 8 number of CPUs for this instance
Friday Jan 27, 2006
A colleague and myself just returned from the "Diagnosing Oracle Performance Problems" course with HOTSOS given by Cary Millsap. This course was definitely top-notch with not only the technical aspect but the "people" aspect of solving performance problems. In my own experience I too have found that by partnering across the organization, you are better able to solve performance problems. Often times when a consultant is brought into an organization they can serve as a focal point. A good consultant can help provide a bridge between the user, developer, DBAs, and sysadmins. This course puts all of these aspects into perspective along with method-R.
I was also excited to be able to use the new version of HOTSOS. The new version is far superior to the share-ware versions on the web. I particularly like the ability to show skew in the various events... very useful for determining root cause.
Finally, I was happy to learn that HOTSOS is creating a new products which will use trace application data to help model and size systems for growth. This is a huge development. Other tools which try to model based on system resources completely miss the fact that all aspects of the application do not grow at the same rate. Using this new methodology, predicting grow should be much easier. Hats off once again to HOTSOS.
Tuesday Jun 21, 2005
Ever since I started reading the "Optimizing Oracle Performance" book by Cary Millsap, I have been salivating over getting a copy of the Hotsos profiler - but alas it is too expensive. Recently I tried to find a copy of their free version "Sparky" but it had been pulled due to support issues.
Finally, I stumbled upon orasrp. This analyzer is written in python and can be used standalone to produce html, or you can browse a directory of trace files via your web browser.
Monday Apr 11, 2005
I am going to be presenting at the spring SUPerG next week 4/21 - 4/23. IMHO, this is the best conference Sun offers - A great forum for techies to meet techies. Since I typically, update my presentation a few times before the conference, I thought it best to post LAG copies.
Hope to see you there,
Glenn
Demystifying High-End Sun Fire Behavior
when Scaling Database Applications
by Glenn Fawcett and Marcus Heckel |
|
Paper rev1.02 updated 4/19/05 after blog comments!!
Presentation rev1.03 updated 4/21/05 prior to pres :)
|
This paper attempts to explain the difference in the Sun Fire Server line. After examining the inherent differences, scaling topics will be discussed. The effects of large user, memory, and cpu count will be discussed along with Tips on how to best manage and scale applications.
Monday Aug 30, 2004
In the past you could only sample Oracle Wait Events and often missed interesting information. In 10g Oracle introduced a new view, V$ACTIVE_SESSION_HISTORY, which keeps a historical view of wait events. I ran across a great article in the Database Journal which describes how to use this new feature... Pretty cool!
Monday Aug 30, 2004
The RAW vs COOKED debate that has been going on for some time. You will find little argument which is more efficient at caching DB buffers. If and Oracle process can get a block by doing a block get, this uses much less code than issuing an IO and getting it from the UFS cache. If this is the case, why have I ran into numerous DBAs that have tried Raw once but saw worse or no performance improvement from RAW?
Further investigation usually reveals that the FS cache was providing caching benefit, that was not able to be realized within Oracle. Without modifying the SGA size and possibly storage parameters on active tables/indexes, it is hard to realize any benefit with Raw.
A good place to start is by looking at your largest objects. If your statistics are current, you can use the following SQL. OEM also does a good job of showing table sizes.
SQL> select table_name, CACHE, num_rows, BLOCKS/128 MB
from all_tables
order by MB desc
/
TABLE_NAME CACHE NUM_ROWS MB
------------------------------ ----- ---------- ----------
BIGACTIVETABLE N 20000000 13258.2891
LOOKUP Y 50000 150.5703
...
Also, You need to know the index sizes so..
SQL> select index_name,
num_rows,
leaf_blocks/128 MB ,
BUFFER_POOL
from all_indexes
where owner = 'STSC'
order by MB desc
/
INDEX_NAME NUM_ROWS MB BUFFER_
------------------------------ ---------- ---------- -------
BIGACTIVETABLE_PK 20000000 9245.6875 DEFAULT
....
Since we now know what the largest tables and indexes are, we should expect to see them in the buffer cache. This can be done by querying the v$bh table.
SQL> SELECT object_name, count(*)/128 MB
FROM v$bh, all_objects
WHERE object_id=objd
GROUP BY object_name
ORDER BY MB desc
/
OBJECT_NAME MB
------------------------------ ----------
LOOKUP 120.09375
BIGACTIVETABLE 100.98438
BIGACTIVETABLE_PK 75.23438
....
Notice that "BIGACTIVETABLE" and it's index are not being cached very well. This is due to the fact that the storage parameter on the table and index are set to "NO CACHE". This causes Oracle to not actively try to cache these blocks. By alter these storage parameters
"alter table BIGACTIVETABLE cache;"
and increasing the SGA, these tables and indexes can use more Oracle buffer cache. Note, that you can also separate objects into the DEFAULT, KEEP, and RECYCLE buffer pools. Statspack reports calculate hit ratios for the DEFAULT, KEEP, are RECYCLE buffer pools separately. I tend to put all objects I am actively trying to KEEP into the KEEP pool so I can figure my caching efficiency via statspack.
I hope this helps inspire you to give this a try. It is not black magic, give it a try.
Take Care,
Glenn