Tuesday Apr 14, 2009
Today, Sun is releasing a vast array of servers and blades with Intel's new Xeon 5560 (Nehalem) processor. We have significantly improved the performance of crypto algorithms (as part of Solaris Cryptographic Framework (SCF)). While some of these changes have been covered in my previous blogs, I would like to summarize them here.
I must first commend Dan Anderson for doing an excellent job in incorporating a lot of hand-coded assembly into the SCF. These enhancements were available in OpenSolaris 2008.10. Since then we have made the following enhancements, which will be available in OpenSolaris 2009.06. You can also try the preview bits of 2009.06 at genunix.org.
1) CR 6799218: RSA using Solaris Kernel Crypto framework lagging behind OpenSSL. We made changes that made RSA decrypt operations 1.8 times faster. The details are documented here.
2) CR 6811474 and CR 6823192 make number of changes to big_mont_mul() and big_mul() routines which form the essence of montogomery multiplication. These changes improve RSA decrypt operations by 10%.
3) CR 6812615: 64-bit RC4 has poor performance on Intel Nehalem. We made changes to the RC4 encrypt routine which delivered an improvement of 25% on Intel Nehalem. These changes are documented here.
The performance of these and other crypto algorithms may be examined using the PKCS#11 compliant Sun Software Crypto plugin. Applications can be linked to the library, /usr/lib/amd64/libpkcs11.so. For benchmarking the performance of SCF, we patched OpenSSL 0.9.8j (patch available at Jan Pachanec's blog) to use pkcs#11. The OpenSSL speed benchmark gives us the following numbers on a Sun Fire X4270 pre-release system with 2-socket Intel(r) Xeon(r)CPU X5560@2.8 GHz processor HT-enabled:
|
Benchmark
|
1-thread
|
16-threads
|
|
RSA-1024 encrypt
|
24.9 K ops/s
|
199.2 K op/s
|
|
RSA-1024 decrypt
|
1760 ops/s
|
14048op/s
|
|
RC4 encrypt (8k message)
|
317 MBytes/s
|
2265 Mbytes/s
|
|
MD5 Hash (8k message)
|
531 MBytes/s
|
6085 MBytes/s
|
|
SHA-1 Hash (8k message)
|
356 MBytes/s
|
2545 Mbytes/s
|
|
AES-256 encryption (8k message)
|
136.9 MBytes/s
|
1212.6 Mbytes/s
|
Please note that these numbers are with Hyper-Threading (HT) enabled on the Nehalem processor, in which two virtual processors share the same execution pipeline. The performance of all algorithms is seen to scale pretty linearly from one-core to 8-cores. Disabling HT did not make much of a difference to the benchmarks, and this could be because crypto algorithmic operations do not have many stalls in the execution pipeline, and therefore the benefit from having virtual processors is less.
For further notes on Sun's Intel Nehalem based servers and blades, I recommend you to read
Heather's blog which cross-links all Nehalem-based blog entries. And please do leave your comments and feedback behind.
Tuesday Apr 07, 2009
I wrote about our work on RSA-decrypt in OpenSolaris two posts ago. One of the biggest obstacles we faced while improving RSA is that the bignum library (which RSA uses for the expensive multiplication routines) is a kernel module. Anything in kernel land is so much harder to play around with -- analyze, improve, debug, measure performance --, and so we worked on porting the kernel code to a userland program. We have something ready now, and the code in userland looks exactly as what is in the current version of bignum in Opensolaris.
More interestingly, we can use this code as a simple benchmark for CPU performance. The current code has been tuned for performance for only x86_64 (CMT has hardware accelerators; therefore software performance of crypto algorithms does not interest us much). As an example, here are the results for RSA1024 decrypt on various systems (and processors):
|
Sun Fire X4150 (Intel Xeon 5450 3.16 GHz Processor)
|
493544 nsec
|
|
Sun Fire X4600 (AMD Opteron(tm) Processor 885, 2.6 GHz)
|
557934 nsec
|
|
Sun Fire X4200 (AMD Opteron(tm) Processor 280, 2.31 GHz)
|
606147 nsec
|
The benchmark code is available for download
here.
However, our main objective remains to improve this benchmark, have better performance on RSA decrypt, and then deliver it to OpenSolaris. This effort has already led to a
CR (which will be fixed soon in OpenSolaris). Hopefully you can contribute more improvements to the code. Please send in your ideas by e-mail or post a comment to this blog.
Here is how the benchmark performs on a X4150 (2-socket quad-core Intel Xeon 3.16 GHz processor). The hottest function is big_mul_add_vec, which is already implemented in assembly for x64. The analysis is done using collect/er_print tools available in Sun Studio 12.
%collect ./bignum_test
Creating experiment database test.1.er ...
%er_print -functions test.1.er
Functions sorted by metric: Exclusive User CPU Time
Excl. Incl. Name
User CPU User CPU
sec. sec.
51.956 51.956
26.899 26.899 big_mul_add_vec
9.236 47.623 big_mont_mul
4.953 11.428 big_sqr_vec
3.663 18.863 big_mul
1.601 1.601 big_mul_set_vec
1.181 49.234 big_modexp_ncp_int
0.921 0.921 big_sub_vec
0.570 0.570 big_sub_pos
0.380 0.380 rand_r
0.370 0.911 genrandomstring
0.280 3.613 big_mul_vec
0.240 0.240 big_cmp_abs
0.220 1.301 big_div_pos
0.170 0.170 big_mulhalf_high
0.160 0.160 _free_unlocked
0.160 0.160 big_copy
0.160 0.540 rand
0.120 0.120 big_mulhalf_low
0.120 0.140 mutex_unlock
0.070 0.130 _malloc_unlocked
0.070 0.640 big_sub_pos_high
0.050 0.050 big_shiftleft
0.050 0.050 sigon
0.040 0.070 mutex_lock_impl
0.030 0.030 _smalloc
0.030 0.300 big_init1
0.030 0.030 cleanfree
0.030 0.030 gettimeofday
0.020 0.090 big_cmp_abs_high
0.020 0.310 big_finish
0.020 0.020 big_n0
0.020 0.290 free
0.020 0.270 malloc
0.020 0.090 mutex_lock
0.010 0.010 big_add_abs
0.010 0.010 big_numbits
0.010 0.010 big_shiftright
0. 0. _init
0. 0. _rt_boot
0. 0. _setup
0. 51.956 _start
0. 51.016 big_modexp_crt_ext
0. 50.065 big_modexp_ext
0. 0.200 big_mont_conv
0. 0.490 big_mont_rr
0. 0. call_init
0. 0. dtrace_dof_init
0. 0. ioctl
0. 51.956 main
0. 0. setup