High performance software crypto on CMT processors
While cryptography is typically viewed as computationally
intensive (and so less well suited to CMT processors), software
implementations of common cryptographic algorithms can be readily
optimized to excel on CMT processors. Current software
implementations have been optimized for traditional processors, with
multiple lookup tables sized to all fit in a processors small level-1
caches. It is this use of multiple small tables that leads to the
high computational overheads associated with most cryptographic
implementations -- due to the significant arithmetic operations
needed to manipulate access to the tables and recombine the results
from the various tables.
For example, consider the Kasumi
algorithm, which is essential to 3G mobile telephony. In Kasumi, a
block is 8-bytes, the key is 128-bits (although it is expanded to a
1024-bit key schedule before use), and processing consists of 8
rounds per block. While a variety of operations are performed per
block, the most costly operation is termed FI and consists of the
following (in C notation):
nine = (u16)(in>>7);
seven
= (u16)(in&0x7F);
nine = (u16)(S9[nine] ^ seven);
seven =
(u16)(S7[seven] ^ (nine & 0x7F));
seven ^= (subkey>>9);
nine ^= (subkey&0x1FF);
nine = (u16)(S9[nine] ^ seven);
seven = (u16)(S7[seven] ^ (nine & 0x7F));
in =
(u16)((seven<<9) + nine);
return( in );
where in
and subkey are two-byte variables, S9 is a 512-element lookup table
and S7 is a 128-element lookup table. This operation is performed
three times per round, for a total of 24 times per block. Each FI
operation requires 22 instructions (for SPARC), for a total of 576
FI-derived instructions per block. Given the abundance of logical and
shift operations, it is apparent that superscalar processors will
perform this function very well, with an Instructions Per Cycle (IPC)
of 2.5 or more. In contrast, the Niagara single-strand IPC is around
0.65. Further, due to the compute intensive nature of the code, the
single Niagara strand uses around two thirds of the processor core's
issue resources. As a result, performance does not scale as
additional Kasumi threads run on a core.
To overcome this
problem, the implementation can be optimized to radically reduce the
instruction count. A reduction in instruction count may be achieved
by replacing large parts of the FI function using a large lookup
table. In the original Kasumi code, the 16-bit elements are divided
into two smaller elements, one 7-bits and one 9-bits. These smaller
elements are processed independently and the results combined. While
this ensures that the lookup tables are small, significant logical
and arithmetic operations are required to split the 16-bit elements
and later recombine the smaller 7-bit and 9-bit elements back into
the 16-bit elements. Significant computational saving may be achieved
by processing an entire 16-bit element at once, using large lookup
tables, as shown below:
t0 = LT0[in];
t0 = t0 ^
subkey;
in = LT1[t0];
The new lookup tables (LT0 and
LT1) are now much larger, each being composed of 65536 2-byte
elements. Note that the lookup tables are constant, may be
precomputed, and are independent of the keys. However, using this
approach, the FI function now only requires five instructions, a four
times reduction from the original implementation. Further note that
in both the optimized and the original code, the lookup table
accesses are dependent and cannot be performed in parallel or
prefetched in advance.
The lookup tables that once fitted
in the L1 cache are now much larger and will now largely reside in
the L2 cache -- this instruction count reduction has been at the
expense of increased memory stalls, but here we are laying to the
strengths of a CMT processor. As a result, it would appear that the
performance of the code will remain largely unchanged, having traded
increased instruction count for increased memory stalls. This
optimization technique is beneficial for at least two reasons. First,
MT (multithreading) performance is improved. For the initial
implementation, due to the large computational requirements of the
algorithm, as additional strands are leveraged, aggregate core
performance improves very little. Given that a single strand is
capable of consuming almost all of a processor core’s
resources, as additional VT/SMT strands are leveraged, these strands
rapidly start to deprive the other strands of resources, and the
aggregate core performance is improved very little. In contrast, in
the optimized version, the strands spend most of their time stalled
waiting for accesses to the lookup tables to complete and consume a
much smaller fraction of a processor core’s resources. As a
result, as the number of strands is increased, performance scales
almost linearly. Indeed, for Niagara, per-core Kasumi performance is
around 8 times the performance of a single strand, and per-chip
Kasumi performance is close to 64X single-strand performance. Indeed,
single-core Kasumi performance is around 1.3X the performance of a
single-core of a 3GHz Xeon processor.

Posted by c0t0d0s0.org on July 12, 2008 at 01:02 AM PDT #
Really great post ! thanks
Posted by seb on July 16, 2008 at 06:12 AM PDT #