Today's Page Hits: 285
This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.
Optimizing Crypto for Intel Nehalem
Sun Blade X6275 is one example platform with 2 Intel Nehalem (Xeon 5500) processors with 4 cores, 8 threads each |
Amitabha Banerjee of the Performance Group wrote 4 blog articles about improving OpenSolaris performance on Intel's "Nehalem" processor. Nehalem is the codename for Intel Xenon 5500 series processors. In this article I want to highlight how we improved performance and refer to his blogs.
What's different about Intel Nehalem? Currently-released Intel Nehalem processors ("Intel Core i7") have 4 cores and 8 threads using 45nm process. But more importantly, the instructions are executed differently from not only AMD64 (as expected), but also older Intel 64 processors. Nehalem has lookahead speculative execution optimizations. These optimizations predict what instructions will probably execute in the future ("branch prediction") and executes instructions ahead of time on a preliminary basis ("translation lookaside buffer" or TLB). If they are in fact executed, the results are available immediately. If they are not executed, or the onboard data cache is invalidated with writes, the results are thrown out and instructions reexecuted. One way to optimize use of this processor is to carefully arrange instructions that doesn't invalidate (write to) memory soon after it is used (read).
Performance Improvement Process Amitab describes how he optimized OpenSolaris cryptography (and therefore SSL/HTTPS web transactions) for Nehalem processors. Basically the steps are:
The last step was the most time consuming, and involved ensuring not only speed, but that it still produced correct results. These optimizations were all with C source. Instead of hand-coding assembly, he would re-code C source. The source would be more complex, but faster, and certainly not as complex as assembly!
I integrated these optimizations, and added my own ideas to improve performance further, along with suggestions from others during coding, regression testing, and code review.
So here's Amitab's blogs:
Related blogs While I'm at it, I'd also like to refer to 2 previous blogs about OpenSolaris x86-64 crypto performance:
Posted at 06:02PM Jun 04, 2009 by DanX in Solaris | Comments[8]
In the new code proposed in CR6823193, the second loop looks valid but strange. It is just the addition of two bignums, but I am sure that if you think of it this way you won't implement it that way...
Also, modifying c[i] in the second loop is strange, it means writing to memory something that doesn't need to be (unless maybe the compiler can detect that the values written in c are never read later?).
Posted by Marc on June 04, 2009 at 11:54 PM PDT #
Marc,
Yes the code seems strange. Optimizing code often makes it less understandable. The "for" loop was split into 2 "for" loops. That's because the assignment "c = 1" was preventing the TLB from executing instructions in advance because c was always being modified. Creating an array c[] solved that problem (as each loop iteration modifies different memory).
The "c[i] = 1" assignment in the second "for" loop is used above in the "while (rr[j] < c[i]" test.
- Dan
Posted by Dan Anderson on June 05, 2009 at 08:05 AM PDT #
Oh I understand about the optimization. What I am saying (if you reread my post) is that there seems to be room for improvement in the second loop, if you want to optimize the code a bit further.
Posted by Marc on June 05, 2009 at 11:37 AM PDT #
Hmmm. Well I don't see anything being assigned that's unused, c[i] is used in the while loop immediately after it's set and rr[] is used later in the function (not shown).
Posted by Daniel Anderson on June 05, 2009 at 12:18 PM PDT #
For c[i], it is used locally, but not later. Storing it in a local variable means it will just be in a register as long as needed. Storing it in c[i] is more likely to mean writing it to memory, which is slower. But this is a minor point.
Most importantly, this second loop looks like an addition between 2 bignums where you add one limb of the second bignum to the first, propagate the carry as far as it will go, then move on to the second limb, add it, propagate as far as it will go, etc. You probably have a bignum_add function somewhere that does something closer to the schoolbook addition (no nested loops, coded in asm to use the adc instruction).
Posted by Marc on June 05, 2009 at 08:33 PM PDT #
Marc,
OK, now I understand your point. Function big_mul_add_vec() is written in amd64 asm and uses adcq. It is called by the Montgomery Multiply big_mul().
If the entire big_mul() function is recoded in assembly, it will probably improve RSA a lot, since a lot of RSA is spent in bignum Montgomery Multiplication. Assembly recoding is time-consuming, but it may be worth it.
- Dan
Posted by Dan Anderson on June 05, 2009 at 09:34 PM PDT #
Depending on the exact interface of big_mul_add_vec(), my point was that it may be possible to replace the second loop by a single call to that function (without coding any new asm). Anyway, good job on the optimizations that have already been done.
Posted by Marc on June 07, 2009 at 12:43 AM PDT #
Er, ignore my last comment, or at least s/big_mul_add_vec/big_add/, I did not think before posting, sorry.
Posted by Marc on June 07, 2009 at 12:46 AM PDT #