In a previous post, threads offloading to the accelerators were bound to specific cores (i.e. for the 8-thread result, only 1 accelerator was used, 16-threads used 2 accelerators and so on). For the following, threads are allowed to float across all 8 accelerators (i.e. the 8-thread result uses all 8 accelerators, as does the 16-thread result and so on).




From the above it is apparent that:

  • As previously observed, a single UltraSPARC T2 processor significantly outperforms a dual-socket quad-core Clovertown

  • On the UltraSPARC T2 processor, the vast majority of the performance that can be derived from the hardware accelerators can be delivered using a small number of threads i.e. with the CPU largely idle

  • Using just 8-threads (12.5% CPU utilization) on the UltraSPARC T2 processor it is possible to significantly outperform 8 Clovertown cores which are 100% utilized!



=============================================

Details of competitive benchmarking

System Under Test:
      Clovertown, snv_70
Crypto library:
      OpenSSL 0.9.8e
 Compiler:
      /usr/sfw/bin/gcc

To measure crypto performance on the x86 
processor, the OpenSSL speed test was used; the standard microbenchmark included 
in the open-source OpenSSL package, which measures raw cryptographic algorithm 
performance as implemented in the OpenSSL library (libcrypto.so) via its own 
proprietary crypto APIs. For AES the metric is throughput (Gb/s). On multi-core 
systems, multiple threads are used to achieve the maximum throughput. For AES, 
performance is quoted for 8KB operations.

For the UltraSPARC T2 processor, the benchmark was 
developed by Sun to measure maximum AES-128 performance when utilizing the Solaris Cryptographic Framework. 
On the multi-core T2 processor, multiple threads are used to achieve the maximum 
throughput. This benchmark is part of a set of cryptographic microbenchmark 
programs internally developed by the Crypto Product Group (NSN). They measure the 
performance provided by Solaris Cryptographic Framework via PKCS#11 API. On the T2 processor, the framework will utilize the hardware accelerators. For AES, 
performance is quoted for 8KB operations.


We have two variants of the code; the first repeatedly performs the following for the cipher and object length specified:

CC_EncryptInit()
CC_Encrypt()

While the second, performs

CC_EncryptInit() 

once and repeatedly performs:

CC_EncryptUpdate()

for the cipher and object length specified.

The 2nd version will obviously deliver higher performance, as the cost of the initialization operation is amortized across all subsequent operations. The OpenSSL speed test is closest to the 2nd variant of the code and it is this variant that we use in these comparisons.



Comments:

So is there code in the kernel to identify threads that can use the crypto accelerators and spread them evenly between the cores, mixing them with threads that don't use the accelerators?

Posted by Marc on November 06, 2007 at 05:22 AM PST #

In this microbenchmark, only crypto threads are running, so this is just a simple case of Solaris spreading load across the processor's 8-cores.

However, the N2 crypto device driver, n2cp, which controls the bulk cipher and secure hash accelerators, will load-balance across the processor when under heavy load, automatically ensuring optimal crypto performance.

Posted by Lawrence Spracklen on November 14, 2007 at 11:28 AM PST #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2008 by sprack