Improving RC4 encrypt performance.
Tuesday Mar 24, 2009
RC4 is one of the best-known and commonly used ciphers. It is a very small and simple algorithm which can be executed very fast. The RC4 encrypt routine (as was in OpenSolaris) is the following:
Inputs: in (input stream), out: (output stream), key: (RC4 key), len: length of the streams.
Version 1
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
arr [i] = arr[j];
arr[j] = ti;
out[ii] = in[ii] ^ arr[(arr[i] + arr[j]) & 0xff]; /* Line 1 */
}
We went through an interesting exercise trying to optimize this loop. We analyzed the performance of this loop using er_kernel/ analyzer tools available in Sun Studio 12. Here is an output from analyzer which shows a snapshot from analyzer depicting the stalls (marked in green) from Line 1. Two stalls are observed during load operations, one before the arr[i] + arr[j] add in Line 1, and the other before the XOR operation in Line 1.
Looking through RC4 encrypt code available in OpenSSL (rc4_enc.c), we observed that they had the steps a little differently. Adopting that into OpenSolaris, the code would be:
Version 2
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr [i] = tj;
out[ii] = in[ii] ^ arr[(ti + tj) & 0xff]; /* Line 1 */
}
This version stores the contents of arr[i] and arr[j] in the initial loads, and re-uses them in Line 1. The compiler was not able to recognize and do the same automatically for us because arr is a pointer reference. Analyzing this code with er_kernel/ analyzer, we get the following output:
Now the stalls have moved to the XOR step. The XOR depends on arr[(ti + tj) & 0xff] to be computed, before the same may be executed. This leads to the stall.
We can re-arrange the loop in the following way in order to avoid the stall. This code is now available at OpenSolaris.org and in the latest version of Solaris Express Community Edition (Nevada) available .
Version 3
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;
arr_ij = arr[(ti + tj) & 0xff];
--len;
for (ii = 0; ii < len;) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;
/* save result from previous loop: */
out[ii] = in[ii] ^ arr_ij;
++ii;
arr_ij = arr[(ti + tj) & 0xff];
}
/* save result from last loop: */
out[ii] = in[ii] ^ arr_ij;
This code spaces out the computation of arr_ij and the XOR operations, thereby avoiding the stall and allowing for more pipelining. Analyzing this code in er_kernel, we can observe that the stalls for the XORs are absent. There are still a couple of stalls, which are hard to avoid because of the dependency between the steps of RC4 encrypt. Here is the output.
The following graph shows the improvement that we get for RC4 encrypt from Version 1 to Version 3 on a Sun Fire X4150 server. It is interesting to observe that a 15% performance gain can be achieved by making the code more efficient to avoid stalls.












Links to the snapshots are broken.
What is the...
There were errors from the permalink. Links should...
Excellent work! Thanks for taking time to exp...