Tuesday Oct 13, 2009

I guess you already heard about Sun Storage F5100 Flash Array and its world record benchmarks.  

But it's not F5100 that I am going to talk about but its smaller sibling called Sun Flash Accelerator F20 PCIe Card.  The name is a mouthful like all Sun product names so I will just call it "The Accelerator Card" in the remainder of this blog entry.  Of course the idea is not to start with the answer and find a problem with it. But I am going to narrate  is how we saw a problem and then thought of using this answer to solve the problem.

Recently our group ISV-E was doing our standard thing of  making applications run best on Sun. In this particular project with PeopleSoft Enterprise 9.0 on M5000 system using Sun Storage 6540, we encountered a problem that certain batch jobs where taking a long time to execute. Peoplesoft Enterprise 9.0 actually have ways to breakup jobs and run them in parallel so as to use the multi-core of the multi-processor system. But yet we could not really leverage the system enough to be satisfactory.  In this project they were using Oracle Database 11g. I got to give it to Oracle, they do have good tools. We used Oracle Enterprise Manager and saw for the troubled batch process, it was showing lot of blue color in its output.


Also looking at the top Objects, the tool reported which tables and index were  troublesome which was causing that amount of blue appear in the chart. This "Blue" problem is what led us to an idea to test out the Accelerator Card in the system and see if can help out here. What we did was created a few tablespaces and spread them out on the four Flash Modules on the Accelerator Card and moved the highly active (or "hot" ) tables and indices to the newly created tablespace. What we saw was simply huge reduction in the blue area and more green. That lead to the slogan in our team

"Go Green with the Accelerator Card !"

The Accelerator card not only reduced the time on this process but many other batch processes which had high IO components.  Here is a relative comparison of how it helped (with additional slight boost from upgrading SPARC64 VII from 2.4Ghz to 2.53Ghz CPUs).


Of course the next question is what if you take the same thing to its bigger sibling, Sun Storage F5100 Flash Array, well that's exactly what we did and as they say the rest is history.(Hint: Read the world records link and search for PeopleSoft)  For more information check out Vince's blog entry on  PeopleSoft Enterprise Payroll 9.0 NA and also  Why Sun Storage F5100 is good for PeopleSoft 9.0 NA Payroll application.

Truly if you use Oracle and use Oracle Enterprise Manager to monitor your application performance and are turning blue by seeing lot of Blue area in the chart then just remember

"Go Green with the Accelerator Card !"


Tuesday Sep 15, 2009

Recently I was working on a project which used Infobright as the database. The version tested was 3.1.1 both on OpenSolaris as well as Solaris 10. Infobright is like a column-oriented database engine for MySQL primarily targeted towards data warehouse, data mining type of project deployments.

While everything was working as expected, one thing we did notice that as number of concurrent connections tried to query against the database we noticed that queries deteriorated fast in the sense that not much parallel benefits were being squeezed from the machine. Now this sucks! (apparently sucks is now a technical term). It sucks because the server has definitely many  cores and typically each Infobright query still can at the max peg a core. So the expectation will be typically to atleast handle concurrent queries which is close to the number of cores  (figuratively speaking though in reality it depends).

 Anyway we started digging into this problem. First we noticed that CPU cycles were heavy so IO was probably not the culprit (in this case). Using plockstat we found

# plockstat -A -p 2039    (where 2039 is the PID of mysqld server running 4 simultaneous queries)

^C 
Mutex hold 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
3634393     1122 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
3626645     1047 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
    2 536317885 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
   12  6338626 mysqld`LOCK_open             mysqld`_Z10open_tableP3THDP13st_table_listP11st_mem_rootPbj+0x55a 
 9057     1275 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
 8493     1051 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
 7928     1119 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
    5   326542 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
  683     1189 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1339 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1274 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1156 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
   17    36292 0x1777780                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
    2   246377 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
   57     8074 mysqld`_iob+0xa8             libstdc++.so.6.0.3`_ZNSo5flushEv+0x30 
  218     1479 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
    4    78172 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
    4    75161 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
….

R/W reader hold 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
   44     1171 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 
   12     3144 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1    14125 0xf7aa18                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    1    12089 0xf762e8                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    2     1886 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    2     1776 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     3006 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     2765 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1797 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1131 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 

Mutex block 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
 2175 11867793 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
 1931 12334706 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
    3 93404485 libc.so.1`libc_malloc_lock   mysqld`my_malloc+0x32 
    1    11581 libc.so.1`libc_malloc_lock   mysqld`_ZN11Item_stringD0Ev+0x49 
    1     1769 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZnwmRKSt9nothrow_t+0x20
..

Now typically if you see libc_malloc_lock in a plockstat for a  multi-threaded program then it is a sign that the default malloc/free routines in libc is the culprit since the default malloc is not scalable enough for a multi-threaded program. There are alternate implementations which are more scalable than the default. Two such options which are already part of OpenSolaris, Solaris 10 are libmtmalloc.so and libumem.so. They can be forced to be used instead of the default without recompiling the binaries by preloading anyone of them before the startup command.

In case of the 64-bit Infobright binaries we did that by modifying the startup script mysqld-ib and added the following line just before invocation of mysqld command.

LD_PRELOAD_64=/usr/lib/64/libmtmalloc.so; export LD_PRELOAD_64

What we found was now the response times for each query was more in-line as it was being executed on its own. well not true entirely but you get the point. For a 4 concurrent queries we found that it had improved from like 1X to 2.5X reduction in total execution time.

Similary when we used libumem.so we found the reduction more like 3X when 4 queries were executing concurrently.

LD_PRELOAD_64=/usr/lib/64/libumem.so; export LD_PRELOAD_64

Definitely something to use for all Infobright installations on OpenSolaris or Solaris 10.

In a following blog post we will see other ways to tune Infobright which are not as drastic as this one but still buys some percentage of improvements. Stay tuned!!










Thursday Jul 23, 2009

Recently I got access to the refreshed Sun Fire X4140 consisting of 2 x 6-core Opterons with 36GB RAM. Since the release of the final PostgreSQL 8.4 bits I had not tried it out so I downloaded the Solaris 10 binaries of PostgreSQL 8.4 (64-bits) from the download site of postgresql.org and took it for the test drive with the same iGen benchmarks that I had used earlier for my PGCon2009 presentation.

The system already had Solaris 10 5/09 installed with couple of  SSDs  and a RAID LUN for the database. I put the WAL log on an internal drive with ZFS intent log on SSDs and the tablespaces on the RAID LUN (on an external storage array).

Notice the crossing of the 400K tpm boundary with PostgreSQL here using this benchmark toolkit. None of my tests have ever done that before. I consider this to be a milestone achievement with PostgreSQL, Solaris 10, Sun Fire Systems with Opterons.




Tuesday Jul 21, 2009

Sun is launching systems with multisocket  6-core Opterons (Istanbul) today. Last week I got access to  Sun Fire X4140 with 2 x 6-core Opterons with 36GB RAM. It is always great to see such a 1RU system packaged with so many x64 cores.

# psrinfo -vp
The physical processor has 6 virtual processors (0-5)
  x86 (chipid 0x0 AuthenticAMD family 16 model 8 step 0 clock 2600 MHz)
    Six-Core AMD Opteron(tm) Processor 8435
The physical processor has 6 virtual processors (6-11)
  x86 (chipid 0x1 AuthenticAMD family 16 model 8 step 0 clock 2600 MHz)
    Six-Core AMD Opteron(tm) Processor 8435


I decided to take the system for a test drive with Olio. Olio is a Web 2.0 toolkit consisting on a web 2.0 event calendar application  which can help stress a system. Depending on your favorite scripting language you can use either PHP, Ruby on Rails, Java as the language used to create the application. (I took the easy way out and selected Olio PHP's prebundled binary kit)

Please don't let the small 2MB kit size fool you thinking it will be a easy workload to test it out. While setting it up I figured that to generate the data population for say 5000 users you will need space with atleast 500GB disk space for the content that it generates for it. Yes I quickly had to figure out how to get a storage array for Olio with about 800GB LUN.

Olio requires a webserver, PHP (of course) and  a database for its metadata store (it has scripts for MySQL already in the kit). The system came preconfigured with Solaris 10 5/09. I downloaded MySQL 5.4.1 beta  and also the Sun WebStack kit which has Apache Httpd 2.2, PHP 5.2 (and also MySQL 5.1 which had not used since I had already downloaded MySQL 5.4 Beta). Memcached 1.2.5 is part of the WebStack download and Olio is configured to use it also by default (but can be disabled too).

Eventually everything was installed and configured in the same X4140 and using the Faban Harness on another system started executing some runs with file store and the meta store preconfigured to handle all the way up to 5000 concurrent users. The results are as follows:

OlioPHP

Here are my observation/interpretations:

  • Eventually beyond 10 cores run I find that the system memory (36GB) is not enough to sustain more concurrent users to fully utilize the remaining cores. I would probably need RAM  in the range of 48GB or more to handle more users. (PHP is not completely thread-safe and hence the web server used here spawns processes)
  • This 1RU system can handle more than 3200 users  (with everything on the same system) with CPU cycles to spare is pretty impressive. It means you still have enough CPU to log into the system without seeing degraded performance.
  • Actually you can see here that SMP (or should be called  SMC - Scalable Multi Cores) type system helps when the initial cores are added  instead of using multiple single core systems (ala in Cloud).

 In an upcoming blog entries I will talk more about the individual components used.



Wednesday Jun 03, 2009

With the release of the OpenSolaris 2009.06, I thought it is time to update the Minimal OpenSolaris 2008.11  Appliance OVF image that I had created earlier. The script create_osol2009006_app.sh has been updated to create minimal OpenSolaris 2009.06 Appliance images for VirtualBox. 

How to use the OVF image?

  • Download VirtualBox 2.2.4 and install it on your host platform.
  • Download the OpenSolaris 2009.06 App OVF image zip file and then unzip it.
  • Fire up Virtualbox GUI and  use menu item VirtualBox->File->Import Appliance to import the image (using the  OSOL200906App.ovf file ) into a new VirtualBox VM
  • Start the newly created VM and in few minutes you will be  ready to login into OpenSolaris 2009.06 kernel.The preset login information is user: root with password: opensolaris.

Comments welcome.

Saturday May 30, 2009

Simon Riggs of 2nd Quadrant recently submitted a patch for testing which should improve read only scalability of Postgres. I took it for a test drive for my setup. In the first set of tests I used the same benchmark as previous ones so as to have the same reference point.

It seems changing the Number of Buffer Partitions for this workload does not have any impact. My dataset for this iGen benchmark is pretty small and should easily fit under 2GB size and hence may not be stressing the buffer partitions too much to warrant bigger number. The patch still helps to get good healthy 4-6% gain in peak values.


Friday May 29, 2009

At PGCon 2009, Jesper Pedersen talked to me about the new Binary Transfer patch which was submitted to the JDBC Driver for Postgres 8.4. I thought it will be nice to compare how the JDBC 8.4 driver compared to older 8.3 JDBC Driver. Hence I took it for a drive

The 8.4 JDBC Driver with BinaryTransfer patch seems to get to a better peak faster but since to taper off at high clients. I don't know if this benchmark was the right benchmark for it. Need more benchmarks which uses JDBC to see the performance difference with this feature.


Thursday May 28, 2009

During my PGCon 2009 presentation there was a question on the saw tooth nature of the workload results on the high end side of benchmark runs. To which Matthew Wilcox (from Intel) commented it could be scheduler related. I did not give it much thought at that time till today when I was trying to do some iGen runs for the JDBC Binary Transfer patch (more on that in another blog post) and also Simon's read only scalability runs . Then I realized that I was not following one of my one tuning advice for running Postgres on OpenSolaris. The advice is to  use FX Class of scheduler instead of the default TS Class on OpenSolaris . More details on various scheduler classes can be found on docs.sun.com.

Now how many times I have forgotten to do that with Postgres on OpenSolaris I have no idea. But yes it is highly recommended specially on multi-core systems to use FX scheduler class for Postgres on OpenSolaris. How much gain are we talking about? The following graph will give an indication using the default TS scheduler class Vs the FX Scheduler class using the iGen benchmark.

The gain is about 14% by just switching over to FX Class. How did I get Postgres server instance to use FX class? I cheated and put all processes of the user (with userid 236177)  in FX class using the following command line.

# priocntl -s -c FX -i uid 236177

One thing to figure out is how to make sure Postgres uses FX scheduler class out of the box on OpenSolaris so I don't keep forgetting about that minute performance tip.





Friday May 22, 2009

On the first day of PGCon 2009 I presented on my results of my testing with Postgres 8.4beta1 vs the earlier version (8.3.7). The good news is it should not cause any regressions to existing users of 8.3.7 to upgrade and exploit the opportunity to use the new features of Postgres 8.4. 


Comments/Questions welcome.


Tuesday May 19, 2009

While working on my upcoming presentation for PGCon 2009 on Thursday, I found that sometimes it is misleading to just take one snapshot of locks to figure the hot locks in PostgreSQL workload characterization.

So again starting from one of the DTrace scripts I arrived at pglockwait_84.d

NOTE: It only works with operating systems that support DTrace. I have only tested it on OpenSolaris as of now.

It can either be used to track to summarize all PostgreSQL backends (using '*')  or selected one using process id using 10 second interval. It also prints time so that it can be dumped into a file for post-processing analysis. 

An example output  is show below during dbt-2 runs using PostgreSQL 8.4 beta1.

# ./pglockwait_84.d '*' 2009 May 19 02:52:14 Lock-Id Mode Wait-Time(ms) Count Dynamic Locks Exclusive 0 5 ProcArrayLock Shared 0 37 Dynamic Locks Shared 1 52 CLogControlLock Exclusive 1 85 BufFreelistLock Exclusive 1 81 CLogControlLock Shared 1 103 ProcArrayLock Exclusive 2 112 BgWriterCommLock Exclusive 10 123 BufMappingLock Exclusive 11 636 XidGenLock Exclusive 17 2 BufMappingLock Shared 34 1566 WALInsertLock Exclusive 49 2305 LockMgrLock Exclusive 65 852 2009 May 19 02:52:24 Lock-Id Mode Wait-Time(ms) Count XidGenLock Shared 0 1 XidGenLock Exclusive 0 12 ProcArrayLock Shared 1 86 BufFreelistLock Exclusive 4 240 BgWriterCommLock Exclusive 5 213 Dynamic Locks Shared 5 157 CLogControlLock Exclusive 6 238 CLogControlLock Shared 6 384 ProcArrayLock Exclusive 57 360 Dynamic Locks Exclusive 158 7 WALInsertLock Exclusive 187 7837 LockMgrLock Exclusive 226 3251 BufMappingLock Exclusive 289 2141 BufMappingLock Shared 895 5513 2009 May 19 02:52:34 Lock-Id Mode Wait-Time(ms) Count XidGenLock Shared 0 0 Dynamic Locks Exclusive 0 6 XidGenLock Exclusive 0 5 ProcArrayLock Shared 1 76 BufFreelistLock Exclusive 3 183 BgWriterCommLock Exclusive 4 118 ProcArrayLock Exclusive 5 229 Dynamic Locks Shared 5 91 CLogControlLock Exclusive 29 198 CLogControlLock Shared 62 272 BufMappingLock Exclusive 141 1685 LockMgrLock Exclusive 206 2175 WALInsertLock Exclusive 221 5540 BufMappingLock Shared 279 4180 2009 May 19 02:52:44 Lock-Id Mode Wait-Time(ms) Count XidGenLock Shared 0 0 Dynamic Locks Exclusive 0 3 XidGenLock Exclusive 0 5 ProcArrayLock Shared 0 67 BgWriterCommLock Exclusive 1 69 BufFreelistLock Exclusive 2 148 CLogControlLock Shared 3 262 CLogControlLock Exclusive 4 199 ProcArrayLock Exclusive 47 277 WALWriteLock Exclusive 64 2 BufMappingLock Exclusive 79 1599 WALInsertLock Exclusive 151 5949 LockMgrLock Exclusive 198 2377 BufMappingLock Shared 223 4345 Dynamic Locks Shared 1568 144 ^C

It throws an output every 10 second and the time spent in acquiring the locks. For the BufMappingLock, LockMgrLock and Dynamic Locks it aggregates all of them together respectively. It's bit high on system resources if you track all Postgres backends but if you already know which one then it can be low on overhead. Hope it is useful to you too as I found it for my purpose.


This blog copyright 2009 by Jignesh Shah