Cost-effective crypto offloading
Following from a prior post discussing the benefits of on-chip accelerators, I just wanted to illustrate how rapidly the minimum 'break-even' object size would increase, even for a processor like the UltraSPARC T2 (where single-strand performance is not the only design-point), as the offload cost is increased:

From the above, it is very apparent that, with long-latency off-chip accelerators, it is difficult to cost-effectively accelerate all but the largest bulk cipher and secure hash operations.
Finally, for more traditional processors, the situation is even bleaker; per-strand SW crypto performance is much higher, causing the break-even points to increase much more rapidly with offload cost.
