Monday Nov 17, 2008

Playing with libsrtp recently and just experimenting with enhancing the library to use the T2 HW crypto. Generally, strp uses AES counter mode. Looking at the libstrp code, there is a keystream_buffer buffer which is XORed with the packet stream. Once the keybuffer is emptied it is refilled. Currently, the buffer is 128-bits i.e. 1 block. This approach is not too ineffecient when performing AES in SW, but will lead to suboptimal performance when using crypto HW. There are typically some SW overheads associated with accessing the crypto hardware, and so performance generally increases with the size of the object being processed. Accordingly, in libsrtp it is preferable if the keystream_buffer is increased considerably in size (e.g. 8KB) and refills are performed much less frequently.

Browsing a few internet forums, it interesting to note that there is quite a debate on the use of autopar in SPEC results, with some calling for its use to be disallowed. While it is true that when autopar is permitted, the resulting performance is no longer single-thread performance, it nevertheless it an interesting measure of how well a particular processor/compiler combination perform (Even before the common use of autopar, SPEC was never just a measure of performance, but was heavily influenced by the sophistication of the compiler). Further, in today's multi-core processors, where caches & off-chip bandwidth are shared, even single-thread SPEC runs don't give all the of information one needs to fully understand the impact on the processor. Rather, SPEC ratio is merely an indication of of well the processor/system can run a single-instance of an application (i.e. peak single-application performance) and SPECrate is a measure of the peak throughput.


Finally, it can be argued that many HPC codes are amenable to autopar, so its use in SPECfp is relevant. Its use in SPECint is more problematic as most integer codes are more difficult to noticeably accelerate using autopar -- its just unfortunate that SPEC06int includes libquantum.....

Friday Nov 14, 2008

With my recent talk of T2s, CMTs and autopar I thought it might be interesting to provide a link to Sun's recently released OpenSPARC Internals book. This provides detailed information on many of these topics and can be downloaded for free here. I contributed the first chapter, so its probably safe to skip that one :-)

Thursday Nov 13, 2008

Following from previous discussions on the benefits of autopar for SPECcpu, the next logical question is "what kind of benefit does autopar provide on CMT processors like the UltraSPARC T2". On UltraSPARC T2, we have a multitude of hardware strands available and, as previously discussed, the bare-metal inter-thread communication latencies are extremely low. I talked with some of the compiler gurus at Sun and sure enough this analysis had been undertaken for SPECfp. The results are as follows:



Pretty cool! 7 of the suites X benchmarks show some benefit, with 1 showing over 30X speedup, a further 3 showing a benefit of over 10X, and the remainder showing 2-4X improvements. Some of the benchmarks show peak performance when using less than 64-threads. This is not unexpected, as this is an out-of-the-box run and given that T2 has shared pipelines and, like most multi-core processors, shared caches and offchip bandwidth, some tweaking is required to maximize performance.

Tuesday Nov 11, 2008

Continuing the SPECcpu theme, an interesting paper from Intel describing the performance upticks from autopar, SSE and other optimizations can be found here. On a dual-core they show decent gains on 6 benchmarks and slight gains on a further two.


Similarly, using the Sun Studio compiler, 8 benchmarks benefit from autopar, delivering a 16% improvement in the geometric mean on a dual-core processor (as illustrated here).



Wednesday Oct 08, 2008

It is interesting to look at recent SPEC2006 results and compare them with the results from just a year or so ago. In addition to the normal improvements one would expect as compilers become increasingly familiar with this fairly new benchmark suite, autoparallelism is being employed to boost scores on this traditionally single-thread performance benchmark.

For SPECint, the autopar gains seem to be limited to one benchmark – libquantum, while on the FP side, there are several. Looking at libquantum:

1) August 07,    3.00 GHz Intel Dual core, 4MB L2$  (4MB between 2-cores), no autopar : libquantum  31.6
2) August 07,    3.00 GHz Intel Dual core, 4MB L2$  (4MB between 2-cores), autopar :    libquantum  78.9
3) September 08, 3.20 GHz Intel Quad core, 12MB L2$ (6MB between 2-cores), autopar :    libquantum 283
4) September 08, 2.66 Ghz Intel Six core,  9MB L2$ (3MB between 2-cores), autopar :     libquantum 316

For libquantum there are obviously other compiler optimizations being undertaken in these latest benchmark submissions (as evidenced by various interim results e.g. here), but the comparison of (3) and (4) alone illustrate the power of threading – 50% reduction in per core L2$, and a 20% reduction in frequency, but the libquantum score improves by 10%. Indeed, if you factor out the frequency difference, the 4 to 6 score scaling is very good. It is also interesting to note that the new libquantum scores are around 10X higher than the scores for the other benchmarks. In fact, if you assume that (3) would be say 3X lower without autoparallelism support in the compiler, then the processors whole SPECint ratio score (which is a geometric mean of all of the benchmark scores) falls by over 10%. If the libquantum score is reduced to that seen in (1), then the overall score falls by almost 20%!


It is interesting to note that comparing (1) and (3), the overall score has improved from 21 to 29.3 – an almost 40% improvement. However, roughly half of that gain comes from threading libquantum! Of the remainder, some comes from frequency, some comes from compiler optimizations that significantly improve hmmer performance, and the remainder from a number of other small increases.


It is interesting that the majority of the gains on this traditionally single-threaded benchmark that we have observed in the last several processor generations come from multithreading.....

Similarly, for SPECfp, there are a number of benchmarks which benefit significantly from autoparallelism. This is not surprising, as FP workloads are typically more amenable to this style of optimization. Comparing the recent results from a quad-core Opteron both with and without autopar, it looks like there are 4 FP benchmarks that benefit significantly:


bwaves : 3X improvement

cactusADM : 7.6X improvement

gemsFDTD : 2.2X improvement

wrf : 1.48X improvement


Cumulatively, threading these 3 benchmarks delivers about a 30% improvement in the SPECfp score!

Monday Sep 22, 2008

Following from the recent post discussing modifying OpenSSL to enable OpenSSH to take advantage of the UltraSPARC T2 crypto accelerators, I should also mention that it is possible to just use the PKCS11 engine modified OpenSSL that Sun provides. You should use the –with-ssl-engine when you configure OpenSSH. Further, it may just be my mistake, but I am having problems getting OpenSSH to use the PKCS11 engine unless I modify openssl-compat.c. In the unmodified code, ssh_SSLeay_add_all_algorithms() does:

/* Enable use of crypto hardware */
ENGINE_load_builtin_engines();
ENGINE_register_all_complete();

I changed this to:

ENGINE *pkcengine;
/* Enable use of crypto hardware */
ENGINE_load_builtin_engines();
pkcengine = ENGINE_by_id("pkcs11");
ENGINE_init(pkcengine);
ENGINE_set_default_ciphers(pkcengine);

and things started working fine. I need to find some cycles to go back I see if I had things misconfigured.

Thursday Sep 18, 2008

Following from the last entry about recent enhancements to SunSSH to enable it to take advantage of the UltraSPARC T2 cryptographic accelerators, for those who use OpenSSH, its also possible to leverage the T2 cryptographic accelerators. One simple way to achieve this, without modifying OpenSSH itself, is just to use a version of OpenSSL that has been modified to take advantage of the HW crypto; for the standard aes-128-cbc operating mode, the simplest way to achieve this is to modify aes_cbc.c to call libpkcs11. Its about a 10 line modification and can be applied to any version of OpenSSL. I will post the required code later today here.

Monday Sep 15, 2008

Great to see from Jan's recent blog entry that SunSSH has been enhanced to take advantage of the UltraSPARC T2 hardware cryptographic accelerators – see here for more details.

I will spend some time playing with this later this week and report more generally on the performance benefits I observe

Friday Sep 05, 2008

Recent discussion about whether AMD's upcoming SSE5 instructions can be used to significantly accelerate (5X) AES can be seen here. Given they don't seem to provide dedicated AES instructions, its tricky to see how this can be readily achieved -- especially given for AES-CBC even special purpose AES instructions only provide around 6X.....

Any thoughts?



Thursday Aug 21, 2008

A new CMT whitepaper discussing how to accelerate multithreaded applications on CMT processors can be found here. The whitepaper touches on high-performance cryptography on CMT processors and microparallelism techniques.

Wednesday Aug 13, 2008

The busstat tool can be a useful performance tuning aid, allowing one to drill into the load an application is placing on the memory sub-system. However, one note of caution, on the T2/T2+ the bank_busy_stalls counters should not be used, as erroneous results are returned – makes it looks like the application is causing bank busy stalls, even when this is not the case. Future revs of busstat are aware of this, but in the interim, this is a performance counter to ignore when tuning your app.

Friday Jul 11, 2008

On a recent rev of Nevada, I just ran some ECC (elliptic curve cryptography) ubenchmarks, comparing a UltraSPARC T2 using the HW crypto accelerators and a 3GHz Xeon:




These numbers are for ecdsa operations. The Xeon #s are from openSSL speed (optimized compilation), while the T2 #s are generating from interacting directly with the framework. These numbers are for binary curves using Galois field operations.

While cryptography is typically viewed as computationally intensive (and so less well suited to CMT processors), software implementations of common cryptographic algorithms can be readily optimized to excel on CMT processors. Current software implementations have been optimized for traditional processors, with multiple lookup tables sized to all fit in a processors small level-1 caches. It is this use of multiple small tables that leads to the high computational overheads associated with most cryptographic implementations -- due to the significant arithmetic operations needed to manipulate access to the tables and recombine the results from the various tables.

For example, consider the Kasumi algorithm, which is essential to 3G mobile telephony. In Kasumi, a block is 8-bytes, the key is 128-bits (although it is expanded to a 1024-bit key schedule before use), and processing consists of 8 rounds per block. While a variety of operations are performed per block, the most costly operation is termed FI and consists of the following (in C notation):

nine = (u16)(in>>7);
seven = (u16)(in&0x7F);
nine = (u16)(S9[nine] ^ seven);
seven = (u16)(S7[seven] ^ (nine & 0x7F));

seven ^= (subkey>>9);
nine ^= (subkey&0x1FF);
nine = (u16)(S9[nine] ^ seven);
seven = (u16)(S7[seven] ^ (nine & 0x7F));
in = (u16)((seven<<9) + nine);
return( in );

where in and subkey are two-byte variables, S9 is a 512-element lookup table and S7 is a 128-element lookup table. This operation is performed three times per round, for a total of 24 times per block. Each FI operation requires 22 instructions (for SPARC), for a total of 576 FI-derived instructions per block. Given the abundance of logical and shift operations, it is apparent that superscalar processors will perform this function very well, with an Instructions Per Cycle (IPC) of 2.5 or more. In contrast, the Niagara single-strand IPC is around 0.65. Further, due to the compute intensive nature of the code, the single Niagara strand uses around two thirds of the processor core's issue resources. As a result, performance does not scale as additional Kasumi threads run on a core.

To overcome this problem, the implementation can be optimized to radically reduce the instruction count. A reduction in instruction count may be achieved by replacing large parts of the FI function using a large lookup table. In the original Kasumi code, the 16-bit elements are divided into two smaller elements, one 7-bits and one 9-bits. These smaller elements are processed independently and the results combined. While this ensures that the lookup tables are small, significant logical and arithmetic operations are required to split the 16-bit elements and later recombine the smaller 7-bit and 9-bit elements back into the 16-bit elements. Significant computational saving may be achieved by processing an entire 16-bit element at once, using large lookup tables, as shown below:


t0 = LT0[in];
t0 = t0 ^ subkey;
in = LT1[t0];


The new lookup tables (LT0 and LT1) are now much larger, each being composed of 65536 2-byte elements. Note that the lookup tables are constant, may be precomputed, and are independent of the keys. However, using this approach, the FI function now only requires five instructions, a four times reduction from the original implementation. Further note that in both the optimized and the original code, the lookup table accesses are dependent and cannot be performed in parallel or prefetched in advance.


The lookup tables that once fitted in the L1 cache are now much larger and will now largely reside in the L2 cache -- this instruction count reduction has been at the expense of increased memory stalls, but here we are laying to the strengths of a CMT processor. As a result, it would appear that the performance of the code will remain largely unchanged, having traded increased instruction count for increased memory stalls. This optimization technique is beneficial for at least two reasons. First, MT (multithreading) performance is improved. For the initial implementation, due to the large computational requirements of the algorithm, as additional strands are leveraged, aggregate core performance improves very little. Given that a single strand is capable of consuming almost all of a processor core’s resources, as additional VT/SMT strands are leveraged, these strands rapidly start to deprive the other strands of resources, and the aggregate core performance is improved very little. In contrast, in the optimized version, the strands spend most of their time stalled waiting for accesses to the lookup tables to complete and consume a much smaller fraction of a processor core’s resources. As a result, as the number of strands is increased, performance scales almost linearly. Indeed, for Niagara, per-core Kasumi performance is around 8 times the performance of a single strand, and per-chip Kasumi performance is close to 64X single-strand performance. Indeed, single-core Kasumi performance is around 1.3X the performance of a single-core of a 3GHz Xeon processor.


Monday Jun 30, 2008

I've been gradually expanding the crypto wiki (which can be found here); adding additional info and some code examples. Please let me know what additional information would be useful to add, how the wiki could be improved, and even add your own thoughts....



This blog copyright 2009 by sprack