Ever more powerful hardware can stress software and lead to bottlenecks not previously seen. I was working recently on Olio scalability where the challenge was to see how many active users our configuration could support while meeting specified response time targets. In our setup, an active user on average did one operation every 5 seconds, and an operation required on average 7 web transactions. Supporting 10000 users across two X2270 web servers therefore involved a transaction rate of about 7200 transactions per second per server. The web tier contained PHP application code that in turn accessed memcached, NFS-mounted filestore, and backend replicated MySQL database.
Needless to say, the initial configuration failed to support 10000 users. The history of the exercise then became one of iterating through a process of identifying bottlenecks, eliminating or working around them, and moving on to discover the next bottleneck. There will always be at least one bottleneck of course, but artificial bottlenecks in software that prevent the underlying hardware from being fully utilized are especially to be avoided.
One of the things we did in the course of a run is use Solaris' lockstat(1M) command to collect information about hot locks in the kernel. Solaris uses several different kinds of lock: locks where it is acceptable to be put to sleep, locks where you are only permitted to spin if an attempt to get the lock fails, Read-Write locks that can allow multiple concurrent readers but only one writer at a time, etc. In our case, contention for Read-Write locks dominated. Here's an example:
R/W writer blocked by writer: 604708 events in 518.141 seconds (1167 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
604575 100% 100% 0.00 1376373 0xffffff0e9e1893c8 ufs_rwlock+0xf6
54 0% 100% 0.00 26305 0xffffff0d38a73630 tmp_remove+0x70
51 0% 100% 0.00 32393 0xffffff0d38a73630 tmp_create+0x126
15 0% 100% 0.00 31474 0xffffff0d38a73630 tmp_rename+0x1b4
The output shows a major bottleneck on a particular ufs rwlock. But what file is that lock associated with? Via mdb(1), the Lock address can be used to find the filename. The Caller gives a strong clue that the lock is ufs-related, and mdb's ::whatis dcmd can be used to identify it:
> 0xffffff0e9e1893c8::whatis ffffff0e9e1893c8 is ffffff0e9e1892f0+d8, allocated from ufs_inode_cache
Again we have a strong clue that the lock is associated with an inode. Proceeding on the assumption that we're dealing with the i_rwlock field of an inode, a sanity check can be performed to verify that the offset of i_rwlock in inode_t is 0xd8, which matches the offset in the ::whatis output. [I admit, I did also look at kernel source to get these fieldnames.]
> ::offsetof inode_t i_rwlock offsetof (inode_t, i_rwlock) = 0xd8
The next step is to get the vnode corresponding to the inode, and from that get the path of the file it represents. The inode's base address was infact printed by the ::whatis dcmd, but it could also be computed by subtracting i_rwlock offset from the lock address. Either works. From there, just print the value of i_vnode, then use the ::vnode2path dcmd.
> 0xffffff0e9e1893c8-0xd8::print inode_t i_vnode i_vnode = 0xffffff0df3646040 > ffffff0e9e1892f0::print inode_t i_vnode i_vnode = 0xffffff0df3646040 > 0xffffff0df3646040::vnode2path /var/apache2/2.2/logs/access_log
Aha! Apache's access_log was a bottleneck. This probably should not be surprising given the rate at which transactions were being logged from 5000+ processes on a single server. Although by this point we had reached our target of 10000 users, we did test what happens when the log was put on /tmp, which was to improve response times further. It does however point to the need for a new, more scalable, logging implementation for Apache. At a minimum, high-activity logs need to go on the fastest available storage.
Speaking of /tmp, the original lock output shows some rwlock activity associated with /tmp as well (as implied by the Callers). And yes, even before we got to the ufs lock issue there was a bigger bottleneck on /tmp, one that was stopping us reaching our target. While the mechanics differ in detail (different structure, fieldname, offset), mdb can be used to identify the files responsible for /tmp contention too:
> 0xffffff0d38a73630::whatis ffffff0d38a73630 is ffffff0d38a73570+c0, allocated from kmem_alloc_224 > ::offsetof struct tmpnode tn_rwlock offsetof (struct tmpnode, tn_rwlock) = 0xc0 > ffffff0d38a73570::print struct tmpnode tn_vnode tn_vnode = 0xffffff0d38c9e480 > 0xffffff0d38c9e480::vnode2path /tmp
The PHP application code was saving session state for each session in a file in /tmp, creating a bottleneck on the rwlock associated with the /tmp directory itself. The workaround for this was to edit php.ini and tell PHP to distribute its session state files across multiple subdirectories. This is done via session.save_path:
< session.save_path = "/tmp" --- > session.save_path = "1;/tmp/http_sessions"
With the /tmp change in place and other tuning of caching layer and MySQL, Apache scaled to the level we needed to support 10000 active Olio users across two X2270 servers. Access_log is important for Service Level Management amongst other things, and while our response times were being met with it on JBOD, rwlock contention indicates a more scalable solution will be required in the future. /tmp can provide some relief albeit at the risk of being on volatile storage.