Hot Networking Technologies

Improving RC4 encrypt performance.

Tuesday Mar 24, 2009

RC4 is one of the best-known and commonly used ciphers. It is a very small and simple algorithm which can be executed very fast. The RC4 encrypt routine (as was in OpenSolaris) is the following:

Inputs: in (input stream), out: (output stream), key: (RC4 key), len: length of the streams.

Version 1
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
arr [i] = arr[j];
arr[j] = ti;
out[ii] = in[ii] ^ arr[(arr[i] + arr[j]) & 0xff]; /* Line 1 */
}

We went through an interesting exercise trying to optimize this loop. We analyzed the performance of this loop using er_kernel/ analyzer tools available in Sun Studio 12. Here is an output from analyzer which shows a snapshot from analyzer depicting the stalls (marked in green) from Line 1. Two stalls are observed during load operations, one before the arr[i] + arr[j] add in Line 1, and the other before the XOR operation in Line 1.

Looking through RC4 encrypt code available in OpenSSL (rc4_enc.c), we observed that they had the steps a little differently. Adopting that into OpenSolaris, the code would be:

Version 2
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr [i] = tj;
out[ii] = in[ii] ^ arr[(ti + tj) & 0xff]; /* Line 1 */
}

This version stores the contents of arr[i] and arr[j] in the initial loads, and re-uses them in Line 1. The compiler was not able to recognize and do the same automatically for us because arr is a pointer reference. Analyzing this code with er_kernel/ analyzer, we get the following output:

Now the stalls have moved to the XOR step. The XOR depends on arr[(ti + tj) & 0xff] to be computed, before the same may be executed. This leads to the stall.

We can re-arrange the loop in the following way in order to avoid the stall. This code is now available at OpenSolaris.org and in the latest version of Solaris Express Community Edition (Nevada) available .

Version 3
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;
arr_ij = arr[(ti + tj) & 0xff];
--len;

for (ii = 0; ii < len;) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;

/* save result from previous loop: */
out[ii] = in[ii] ^ arr_ij;

++ii;
arr_ij = arr[(ti + tj) & 0xff];
}
/* save result from last loop: */
out[ii] = in[ii] ^ arr_ij;

This code spaces out the computation of arr_ij and the XOR operations, thereby avoiding the stall and allowing for more pipelining. Analyzing this code in er_kernel, we can observe that the stalls for the XORs are absent. There are still a couple of stalls, which are hard to avoid because of the dependency between the steps of RC4 encrypt. Here is the output.

The following graph shows the improvement that we get for RC4 encrypt from Version 1 to Version 3 on a Sun Fire X4150 server. It is interesting to observe that a 15% performance gain can be achieved by making the code more efficient to avoid stalls.

[3] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

About RSA decrypt performance

Tuesday Mar 24, 2009

We started looking at some common cryptography algorithms and investigate their performance on our x86 based servers about two months ago. Crypto performance is important, particularly because x86 processors do not have the crypto hardware accelerator that the CMT systems have.

RSA is the defacto standard for public key cryptography and is widely used for the key-exchange step in the Transport Layer Security (TLS)/ Secure Socket Layer (SSL) protocols. A client which desires a secure connection with the server negotiates a symmetric-key algorithm such as RC4, Triple DES, AES, etc. Then, in order to send the cipher, it encrypts it using the public key of the server. Only the server can decrypt this using its private-key. Following the key exchange, a secure channel is established by encrypting every communication with the cipher.

Turns out that RSA decrypt is one of the most expensive crypto operations that the server performs. RSA decrypt involves modular exponentiation, i.e computing a^q mod n where a,q, and n - all are huge sized numbers. RSA1024 requires this operation with 512-bit sized a,q and 1024-bit sized n. The operation can be simplified a little bit with a technique known as Montogomery reduction, but that too required 512-bit multiplications. Thus improving RSA decrypt is often the key to improving the scalability for how many sessions a web server can support.

We started out comparing how RSA decypt provided as part of the Solaris Crypto Framework(SCF) with that of OpenSSL, a publicly available SSL library. It turned out that OpenSSL 0.9.8j has much improved RSA-decrypt performance, compared to OpenSSL 0.9.8a, the version shipped with Solaris and some flavours of Linux. A comparison of the two versions using the OpenSSL speed benchmark, compiled with Sun Studio 12 on a Sun Fire X4150 server is shown below. It is very interesting to note that with the same hardware, RSA decrypt performs 3 times faster with a new version of OpenSSL.



So what buys RSA decrypt such a lot of improvement? Looking through OpenSSL changelog, bn_mont_mul(), the montogomery multiplication operation was introduced as an assembly routine for x86_64 in 0.9.8h. Montogomery multiplication is the most vital step in RSA decrypt; it allows for easy modular multiplication (a*b mod n) arithmetic by multiplications using bit-shifting operations. An optimized assembly implementation of the same in OpenSSL 0.9.8j leads to 3x faster RSA decrypt. This assembly is available at /crypto/bn/asm/x86_64-mont.pl in the OpenSSL source.

When we compared the OpenSSL 0.9.8j number with the speed of RSA decrypt provided by SCF (measured using the pkcs11 engine), we found that the latter was about 2x slower. The pkcs11 engine can be used with OpenSSL with the following command:

/usr/bin/amd64/openssl -engine pkcs11

To measure RSA performance:

/usr/bin/amd64/openssl speed rsa -engine pkcs11

Investigating the performance of the same with the er_kernel/analyzer tools available in Sun Studio 12, we found that the biggest CPU consumer was the big_mul_add_vec32() routine, which does the multiplications. Dan Anderson worked through a number of changes as part of the fix for CR 6799218 in which the bignum data structure was changed to support 64-bit chunks instead of 32-bit chunks. With 64-bit multiplication support in 64-bit processors, the number of multiplicative operations reduces by a factor of 4, and this gave us nearly 1.8x time gain for RSA decrypt. We also found an unnecessary copy operation in the big_mul() routine, and took it out as a fix for CR 6811474. You can check out the new code of big_mul() at OpenSolaris.org. The following figure shows RSA decrypt performance with OpenSolaris pkcs11 library on a X4150 server, before and after the change.



This exercise shows how big a difference a performance tuned software can make. There are usually a few hot spots of code which get executed a lot more times than the others. In crypto operations, hot spots are very prominent. As Amdahl's law suggests, it is important to prioritize performance improvement on areas which would give the best overall system improvement, and targeting hotspots is a great example of this. In my next blog, I will discuss how we tuned another crypto operation, RC4 encrypt for better performance.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Trip Report from IOM-2009

Thursday Feb 19, 2009

Last weekend I participated at IOM-2009, a workshop on The Influence of I/O on Microprocessor Architecture, co-located with the High Performance Computer Architectures (HPCA) conference-2009 at Raleigh, NC. Ram Huggahalli, Principal Engineer, and Platform Architect in Communication Technology Lab, Intel served as the workshop chair. I must say that this workshop was very well organized. There were 8 presentations, 45 minutes each, leaving ample time for valuable discussions and feedback.

Challenges
The workshop addressed the challenge of how to provide more I/O to systems, particularly with 40 GigE and 100 GigE getting standardized very soon. Here are the challenges as listed by the workshop chair.

1) Making I/O an integral part of chip/system design instead of having it as a peripheral device. Networking and other I/O needs to be integrated with microprocessor design instead of being thought as a peripheral device.
2) Making some kind of revolutionary change in the way network I/O is performed because (i) Memory access wont become faster, (ii) Demand for I/O (networking) will only increase and the current way will probably not scale.

Summary of viewpoints:
Main viewpoints expressed in the workshop:
1) A presentation from Sandia National Labs demonstrated experimental data on pre-release Nehalem systems. Main points: (i) Nehalem showing much improved network I/O because of reduced local memory latency,
(ii) NUMAness in Nehalem plays major role, many benchmarks show great difference using local memory than using remote memory. a list of commercial benchmarks for which performance varies a lot depending on whether memory is located locally or remotely. Memory is controlled in linux with numactl.

2) Main stumbling bottleneck for Network I/O: Avoiding copy required from user space to kernel space for transmit and vice-versa for receive. Strategies presented:
(i) Cache injection (2 papers: 1 from IBM Labs, Zurich, and 1 from Univ. of Victoria): Strategy is to inject data received directly into cache so that when receive() is issued, there is no cache-miss and data is readily available. Challenge is that data that is displaced in cache will cause cache-misses, and therefore it is hard to come up with an algorithm which is suited to all workloads. Presenters from Univ. of Victoria presented strategies in the context of MPI running on a IBM Cell processor.

(ii) IOMMU to drive hardware accelerators (1 paper from Univ. Pittsburgh, Intel): Using IOMMU hardware to access physical memory by supplying virtual address so that a hardware accelerator device can directly access memory. The presenter demonstrated this approach in the context of a USB drive.

(iii) Creating DMA Cache (Chinese Academy of Sciences) : Having separate cache to keep I/O data before it can be read by application. As a result primary cache is not affected.

(iv) Intelligent NICs (Virginia Tech): NICs which can interact with the CPU and transfer data when required.

(3) Other Interesting papers:
(i) Using network processors to create virtual NICs. (Univ Massachussetts, Lowell)
(ii) Active end-system analysis for finding bottleneck rate for receive network I/O: This work is mainly from UCDavis, while I have contributed to the theoretical part. This work demonstrates the importance of pacing on the transmit side, and illustrates how to compute the bottleneck at the receiver using a stochastic model. Slides are available here.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

CPU Utilization in networking - the TCP transmit path

Wednesday Dec 17, 2008

I am planning to write a series of blogs on CPU Utilization in networking. Here is the first one focusing on the TCP transmit path. This blog article primarily considers Solaris, although equivalent concepts can be applied to other Unix based operating systems.

Let us first examine CPU utilization in Solaris. Here are the results for network I/O using TCP transmit. Our System Under Test(SUT) is a Sun Fire X4440 server, a 16-core, 4-socket AMD Opteron based system with 64 GB memory, and 1 Myricom 10 Gig Ethernet card. It is connected to 15 Sun v20z clients (using one on-board 1 GigE NIC on each system), via a Cisco Catalyst 6500 switch. We use uperf 1.0.2 in these measurements. The profile is a bulk throughput oriented profile, while the write size is varied. The results are collected using the Network Characterization Suite (NCS) that we have developed in our group, and which will be open-sourced soon. Very briefly, the cycles per second is measured using DTrace, and describes how many CPU cycles are required to transmit every KByte of data. usr/sys/idle is measured using vmstat, while intr is measured using another Dtrace script. Throughput is reported by uperf at the end of the 120 second run.

Here are the results:

TCP Transmit tests using uperf with write size = 64K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/3/0/96)              11792          928.11Mb/s
4        256k    (0/4/0/95)              3365           3.71Gb/s
8        256k    (0/7/1/92)              2815           7.42Gb/s
32       256k    (0/8/1/91)              2793           9.22Gb/s
100      256k    (0/9/1/90)              3161           9.24Gb/s
400      32k     (0/24/3/74)             8392           8.93Gb/s
1000     32k     (0/24/3/74)             8406           8.80Gb/s
2000     32k     (0/31/4/68)             12869          7.01Gb/s
4000     32k     (0/32/5/67)             14418          6.44Gb/s
6000     32k     (0/35/5/63)             17053          6.37Gb/s

TCP Transmit tests using uperf with msg size = 8K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/0/95)              14259          896.23Mb/s
4        256k    (0/5/0/93)              4276           3.72Gb/s
8        256k    (0/10/1/89)             4385           7.13Gb/s
32       256k    (0/14/2/85)             4951           8.46Gb/s
100      256k    (0/16/2/83)             5515           8.11Gb/s
400      32k     (0/29/3/69)             10738          7.46Gb/s
1000     32k     (0/31/4/68)             11388          7.31Gb/s
2000     32k     (0/36/6/62)             16818          6.44Gb/s
4000     32k     (0/37/7/61)             14951          6.28Gb/s
6000     32k     (1/34/5/64)             18752          6.19Gb/s

Section: TCP Transmit tests using uperf with msg size = 1K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/1/95)              13450          915.02Mb/s
4        256k    (0/21/4/77)             19239          3.53Gb/s
8        256k    (0/38/6/60)             20890          5.42Gb/s
32       256k    (0/46/8/52)             18792          5.97Gb/s
100      256k    (0/48/8/50)             21831          5.77Gb/s
400      32k     (1/58/9/40)             24547          5.81Gb/s
1000     32k     (1/53/9/45)             31557          4.73Gb/s
2000     32k     (1/51/9/47)             38520          3.89Gb/s
4000     32k     (1/58/11/40)            40116          3.98Gb/s
6000     32k     (1/53/9/45)             40209          3.97Gb/s
The key metric to see above is cycles/Kbyte. We would like to spend as few cycles as possible for a bytes of transmission. So we want this number to be as low as possible. From the results above, we can infer the following about CPU utilization:

(1) The CPU utilization drops with increase in number of connections. The single connection case is an exception.

(2) The CPU utilization drops with a smaller write size.

(3) Most of the CPU is consumed in the kernel (sys). With increase in number of connection, the usr column goes up too due to the overhead of so many threads (In uperf, each connection is established on an independent thread).

(4) The throughput follows the same trend as CPU Utilization.

Most of the above is on expected lines, but to understand this better, let us profile CPU utilization for the case of 4000 connections doing 8k sized writes, and the case of 100 connections doing 32k writes. We use dtrace based er_kernel to gather the profile data, and then er_print to view the CPU utilization. er_print displays both inclusive (including function calls origination from the mentioned function), and exclusive (excluding all other function calls). The following syntax of er_print is used to list functions and sort them in order of inclusive CPU utilization.
er_print -metrics i.%kcycles:e.%kcycles -sort i.%kcycles -function pae4440oneportmyri.er/
Filtering through the data, we gather the following CPU utilization for various function calls.

For 4000 connections with 8K sized writes (Throughput=6.28 Gbps):

FunctionInclusive CPU Utilization %
write()5.28%
tcp_wput_data()4.88%
myri10ge_onetrack()6.3%
tcp_rput_data()7%

For 100 connections with 64K sized writes (Throughput=9.24 Gbps):

FunctionInclusive CPU Utilization %
write()4.62%
tcp_wput_data()1.82%
myri10ge_onetrack()1.01%
tcp_rput_data()2.02%

Ratio of CPU Utilizations normalized to bandwidth:

FunctionNormalized CPU Utilization Ratio (4000 connections, 8K writes/ 100 connections, 32 K writes)
write()1.68
tcp_wput_data()3.94
myri10ge_onetrack()9.17
tcp_rput_data()5.09

Comparing the normalized values, the cost of the system call write() doesn't change much. Copying becomes a little more efficient with a increase in write size from 8K to 64K. Increase in number of connections is not expected to add to the cost of write().

tcp_wput_data() turns more expensive as the effectiveness of Large Segment Offload (LSO) decreases with higher number of connections, resulting in increased number of function calls and reduced efficiency. Please read my blog about LSO on the Solaris networking stack here.

The driver send routine myri10ge_one_track() turns more expensive due to a combination of smaller LSO segments, and increased cost of DMAing the higher number of segments. We observer that in terms of increase, the cost of driver send operations increases the most (>9x).

Finally, with higher number of connections, we observe a TCP ACK ratio of 2:1, instead of the maximum of 8:1 that is possible on a LAN. A lower ACK ratio leads to higher number of ACK packets and subsequently, a higher cost of tcp_rput_data().

In conclusion, CPU efficiency in the transmit path may reduce due to the following factors.

(i) Poor LSO efficiency: This causes higher number of function calls for driving the same volume of data.
(ii) Higher number of DMA calls: More number of DMA operations leads to reduced CPU efficiency since each DMA operation would require binding and later freeing DMA handles which are expensive operations.
(iii) Poor ACK ratio: A 8:1 ACK ratio leads to lower volume of TCP ACKs and frees CPU cycles. The ACK ratio is seen go reduce with increase in connections.

[3] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg