Thursday May 08, 2008

As stated in an earlier entry, when running on an UltraSPARC T2 processor, applications using the Java cryptographic extensions (JCE) should (when applicable) automatically leverage the on-chip cryptographic accelerators.

Following a recent conversation with a Java Guru, you should check the following, if you experience problems:

Java on Solaris automatically sets SunPKCS11-Solaris (which calls into
the Solaris Crypto Framework) as the default security provider, so you
need to do nothing.

This begins from some version of J2SE 5.0. You can go look at the
${java.home}/lib/security/java.security file. There should be one line
look like:

security.provider.1=sun.security.pkcs11.SunPKCS11
${java.home}/lib/security/sunpkcs11-solaris.cfg




Wednesday Apr 30, 2008

I typically witter on about crypto performance at the microbenchmark level, but I was recently browsing the SPECweb05 results and I was impressed to see how the T2 performs, especially on the Banking workload, which is 100% HTTPS:


Processor

SPECweb2005_Banking

1 x T2 [1.4GHz]

70,000

2 x Quad-core Opteron Processor (2356) [2.3GHz]

50,856

2 x Quad-core Xeon Processor X5460 [3.2GHz]

51,840

4 x Quad-core Xeon Processor X7350 [3.0GHz]

71,104


Intel 2-chip http://www.spec.org/web2005/results/res2008q1/web2005-20080225-00104.txt
Intel 4-chip http://www.spec.org/web2005/results/res2007q4/web2005-20071203-00101.html

Opteron http://www.spec.org/web2005/results/res2008q2/web2005-20080409-00107.txt
T2 http://www.spec.org/web2005/results/res2008q2/web2005-20080408-00105.txt


Pretty Impressive! So a single-socket UltraSPARC T2 processor provides equivalent performance to 4-socket x64 systems containing Quad-core processors! On a per socket basis, T2 outperforms the competition by over 2.7X!


Now, this performance leadership is not all down to the HW crypto support – I'm sure the onchip NICs, and abundance of threads help somewhat too. However, the cryptographic overheads associated with HTTPS are pretty significant – RSA ops for session establishment and then RC4 and MD5 (these are the algorithms used for SPECweb2005 anyway) operations to secure and authenticate the subsequent traffic. In fact, looking at the following figures:



Figure 1: Relative costs in an HTTPS transaction for different file sizes. Referenced from here






Figure 2: Typical breakdown of overheads for SPECweb2005 banking





it is apparent that a significant proportion of the total application-level overheads are associated with cryptographic processing. Its therefore not surprising that providing HW support to accelerate cryptographic processing provides a significant performance advantage to the UltraSPARC T2 processor on SPECweb05 banking...


Its nice to see that the good microbenchmark numbers actually translate into significant gains at an application level....

Monday Apr 14, 2008

The other day I was looking for a C code example of illustrating how to leverage the softtoken key store when directly interacting with the Solaris crypto framework. There's substantial documentation available but I couldn't find a basic example. So here's what I concocted:


1. Configure my softtoken keystore via the command line:


	pktool setpin keystore=pkcs11 
	pktool genkey label=test_key keytype=aes keylen=128 
	pktool list objtype=key 


where the first operation updates the passphrase required to access the keystore. If the keystore doesn't exist, the keystore is first created. The default passphrase is “changeme”. The second operation creates a 128-bit AES key and installs it in the keystore – the label associated with the key is “test_key”. The third operation displays the contents of the keystore, so it is possible to confirm that the key has been created correctly.


2. Use the AES key in the keystore from a application that is directly using the Solaris crytpto framework:


#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <security/cryptoki.h>
#include <security/pkcs11.h>

int
main()
{
  CK_RV rv;
  CK_ULONG found_keys;
  CK_MECHANISM mechanism;
  CK_OBJECT_HANDLE hKey, key_list[1];
  CK_SESSION_HANDLE hSession;
  CK_UTF8CHAR label[] = {"test_key"};

  unsigned char ivec[16];
  unsigned char userPIN[] = {"mykeystore"};

  mechanism.mechanism = CKM_AES_CBC;
  mechanism.pParameter = ivec;
  mechanism.ulParameterLen = 16;

  rv = C_Initialize(NULL);
  if (rv != CKR_OK) {
      fprintf(stdout, "C_Init: rv = 0x%.8X\n", rv);
      exit(1);
  }

  /*Use metaslot i.e. slot 0*/
  rv = C_OpenSession(0, CKF_SERIAL_SESSION | CKF_RW_SESSION,
                     NULL, NULL, &hSession);
  if (rv != CKR_OK) {
      fprintf(stdout, "C_openSess: rv = 0x%.8X\n", rv);
      exit(1);
  }

  /*Log in using the correct passphrase*/
  rv = C_Login(hSession, CKU_USER, userPIN, sizeof(userPIN));
  if (rv != CKR_OK) {
      fprintf(stdout, "C_Login: rv = 0x%.8X\n", rv);
      exit(1);
  }

  /* Get the key object, where lable is the label of the
   * key we which to leverage
   */
  CK_ATTRIBUTE template[] = {
      {CKA_LABEL, label, sizeof(label)-1}
  };

  rv = C_FindObjectsInit(hSession, template,
                         sizeof (template) / sizeof (CK_ATTRIBUTE));
  if (rv != CKR_OK) {
      fprintf(stdout, "C_FindObjectsInit: rv = 0x%.8X\n", rv);
      exit(1);
  }

  rv = C_FindObjects(hSession, key_list, 1, &found_keys);

  hKey = key_list[0];

  if (rv != CKR_OK) {
      fprintf(stdout, "C_FindObjects: rv = 0x%.8X\n", rv);
      exit(1);
  }

  if (found_keys != 1)
  {
      fprintf(stdout, "C_FindObjects found %d objects\n", found_keys);
      exit(1);
  }

  /* Initialize the encryption operation in the session */
  rv = C_EncryptInit(hSession, &mechanism, hKey);
  if (rv != CKR_OK) {
      fprintf(stdout, "C_EncryptInit: rv = 0x%.8X\n", rv);
      exit(1);
  }

  .
  .
  .


in the above example it is assumed that the phasephrase is set to “mykeystore”.


The keys stored in the Solaris softtoken keystore are encrypted and they are also checked for integrity. The PBKDF2 function defined in PKCS#5 is used for generating the keys from the passphrase.


Wednesday Apr 09, 2008

As I've mentioned in previous entries, Sun's latest UltraSPARC T2 Plus processors, which are launched today, continue to provide hardware acceleration for a wide variety of important cryptographic operations.


Acceleration is provided in an identical manner to the original UltraSPARC T2 processor - each core has its own hardware cryptographic accelerator that provides support for public-key operations (RSA, DSA, DH, ECC), bulk ciphers (RC4, DES, 3DES, AES-{128/192/256}) and secure hashes (MD5, SHA-1, SHA-256). For the bulk ciphers the currently supported chaining modes are ECB, CBC and CFB64 for DES/3DES and ECB, CBC, and CTR for AES.


The Sun SPARC Enterprise T5240 and T5140 Servers both support 2 UltraSPARC T2 Plus processors for a total of up to 16 cryptographic accelerators per system. Access to the accelerators is via the Solaris Cryptographic Framework (either directly, or indirectly via Java, OpenSSL or NSS) and the framework will automatically load balance requests across the 16 accelerators.


For both the T5240 and the T5140 the accelerators provide an aggregate throughput of up to 80Gb/s of AES-128 (enabling wire-speed encryption), and over 70,000 RSA-1024 sign operations/sec. And this performance can be delivered while the processor is largely idle and available for other processing, essentially eliminating the normally significant overheads associated with crypto processing (zero-cost security!).


Friday Mar 07, 2008

Interesting blog can be found here on how to simply modify OpenSSL to use the UltraSPARC T2 HW crypto accelerators – assuming, you don't just want to use the version in Solaris 10 that has already been modified to take advantage of the accelerators.


If you doesn't use the EVP functions that are engine aware, but instead use some of the low-level functions, its pretty easy to modify these to use the accelerators too. Just as a simple experiment I modified the latest version of OpenSSL to use the UltraSPARC T2 crypto HW for AES-CBC operations, by modifying AES_cbc_encrypt() in aes_cbc.c. Things work fine – you see a very nice performance improvement! I'll try and clean-up my hacked code and post it in the next few days.



UPDATE (2008/03/11):
Pointers to patches for the latest versions of OpenSSL can be found here. The support provided via the PKCS11 engine can be determined via:


/usr/sfw/bin/openssl engine -c -t

Thursday Nov 29, 2007

One interesting benefit of using the UltraSPARC T2 hardware cryptographic accelerators is that many traditional side-channel attacks are impossible.

For instance, many cache-based attacks are not feasible, and a number of timing attacks are also thwarted.

While this immunity can be achieved in software, there is typically a significant performance penalty involved. With the hardware crypto accelerators, not only is immunity to many potential attacks achieved, but vastly improved performance is also delivered.

I will try and post more details on my reasoning shortly.

Wednesday Nov 14, 2007

In the previous post, the performance of userland offloads was discussed. For a userland offload, the source data is copied from user space to kernel space, and the SPU operates on the kernel-space version. Similarly, the SPU results are dumped to kernel space and are then copied back from kernel to user space before the operation completes.

This copying adds some additional overhead that is not present for kernel offloads to the accelerator. The following performance data is for 8KB AES-128-CBC offloads for a kernel consumer;


Krishna discusses how to use the kernel crypto framework directly in his blog here.

Monday Nov 05, 2007

In a previous post, threads offloading to the accelerators were bound to specific cores (i.e. for the 8-thread result, only 1 accelerator was used, 16-threads used 2 accelerators and so on). For the following, threads are allowed to float across all 8 accelerators (i.e. the 8-thread result uses all 8 accelerators, as does the 16-thread result and so on).




From the above it is apparent that:

  • As previously observed, a single UltraSPARC T2 processor significantly outperforms a dual-socket quad-core Clovertown

  • On the UltraSPARC T2 processor, the vast majority of the performance that can be derived from the hardware accelerators can be delivered using a small number of threads i.e. with the CPU largely idle

  • Using just 8-threads (12.5% CPU utilization) on the UltraSPARC T2 processor it is possible to significantly outperform 8 Clovertown cores which are 100% utilized!



=============================================

Details of competitive benchmarking

System Under Test:
      Clovertown, snv_70
Crypto library:
      OpenSSL 0.9.8e
 Compiler:
      /usr/sfw/bin/gcc

To measure crypto performance on the x86 
processor, the OpenSSL speed test was used; the standard microbenchmark included 
in the open-source OpenSSL package, which measures raw cryptographic algorithm 
performance as implemented in the OpenSSL library (libcrypto.so) via its own 
proprietary crypto APIs. For AES the metric is throughput (Gb/s). On multi-core 
systems, multiple threads are used to achieve the maximum throughput. For AES, 
performance is quoted for 8KB operations.

For the UltraSPARC T2 processor, the benchmark was 
developed by Sun to measure maximum AES-128 performance when utilizing the Solaris Cryptographic Framework. 
On the multi-core T2 processor, multiple threads are used to achieve the maximum 
throughput. This benchmark is part of a set of cryptographic microbenchmark 
programs internally developed by the Crypto Product Group (NSN). They measure the 
performance provided by Solaris Cryptographic Framework via PKCS#11 API. On the T2 processor, the framework will utilize the hardware accelerators. For AES, 
performance is quoted for 8KB operations.


We have two variants of the code; the first repeatedly performs the following for the cipher and object length specified:

CC_EncryptInit()
CC_Encrypt()

While the second, performs

CC_EncryptInit() 

once and repeatedly performs:

CC_EncryptUpdate()

for the cipher and object length specified.

The 2nd version will obviously deliver higher performance, as the cost of the initialization operation is amortized across all subsequent operations. The OpenSSL speed test is closest to the 2nd variant of the code and it is this variant that we use in these comparisons.



Monday Oct 22, 2007

Following from a prior post discussing the benefits of on-chip accelerators, I just wanted to illustrate how rapidly the minimum 'break-even' object size would increase, even for a processor like the UltraSPARC T2 (where single-strand performance is not the only design-point), as the offload cost is increased:




From the above, it is very apparent that, with long-latency off-chip accelerators, it is difficult to cost-effectively accelerate all but the largest bulk cipher and secure hash operations.

Finally, for more traditional processors, the situation is even bleaker; per-strand SW crypto performance is much higher, causing the break-even points to increase much more rapidly with offload cost.

Thursday Oct 11, 2007


Interesting to note that:

1) The UltraSPARC can hit the HW peak accelerator performance with the majority of the threads idle, allowing other useful work to be conducted while the RSA operations are being performed.


Comparing the crypto performance of a 2-socket quad core against a single socket UltraSPARC T2 processor shows the very significant performance advance this CMT processor has over more traditional processors.

Thursday Sep 13, 2007

In the OpenSSL demos/sign subdirectory there is a simple demo code (sign.c), that signs and verifies a short message, leveraging RSA.

The modifications required in order to offload the RSA operations to the accelerator are fairly simple. At the start of main, the following is required to instruct OpenSSL to leverage the PKCS11 engine:

  ENGINE *e;

  ENGINE_load_builtin_engines();
  e = ENGINE_by_id("pkcs11");
  if(!e) exit(1);
  ENGINE_set_default_RSA(e);

[For reference, the modified application can be found here]

Its also necessary to leverage the version of the OpenSSL which ships with Solaris:

cc -fast -I /usr/sfw/include -L /usr/sfw/lib -lcrypto sign.c -o sign.out

You can check to ensure that the HW accelerators where utilized via kstat:

kstat -m ncp | grep rsa

If you check the counters before running the test:

 kstat -m ncp | grep rsa
	rsaprivate                      33003
	rsapublic                       5

and after running the test:

  kstat -m ncp | grep rsa
	rsaprivate                      33004
	rsapublic                       6

it is apparent that both the sign and verify operations where offloaded to the HW accelerators.

Basically, as long as you as using the EVP_ functions, rather than using the low-level OpenSSL functions directly, it is a simple matter to modify an application to use the accelerators.

Wednesday Sep 12, 2007

The UltraSPARC T2 hardware crypto features are exposed via 3 drivers in Solaris:

1) ncp - similar to the UltraSPARC T1; handles RSA, DSA, DH and ECC (more details can be found here)

2) n2cp - handles bulk ciphers and hashes [more details can be found here (the supported modes of operation are also detailed)]

3) n2rng - access to the HW random number generator (more details can be found here )

Tuesday Sep 11, 2007

Very detailed info on the UltraSPARC T2 cryptographic accelerators can be found here on the OpenSPARC website (the pertinent info can be found in chapter-21 of the doc)

Sunday Sep 02, 2007

In a previous post, just to illustrate how traditional processors compare to the UltraSPARC T2, I posted OpenSSL results for Clovertown and Opteron processors. To broaden the comparison, here's the same data for the Power5 processor:

On a 1.9GHz P5, each core seems capable of:
 RSA-1024 (sign)   : 275 Ops/sec
 AES-128-cbc (8KB) : 80 MB/sec

I would expect the 4.6GHz Power6 to do better, but even if one (generously) assumes linear scaling, its still a drop in the ocean compared to the per core per-performance of the UltraSPARC T2:

RSA-1024 (sign)   :  4.6K Ops/sec
AES-128-cbc (8KB) :  640MB/sec

This blog copyright 2008 by sprack