Today's Page Hits: 248
This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.
Optimizing Byte Swapping for Fun and Profit
Performance Improvements in OpenSolaris 2008.11
I recently pushed source changes to OpenSolaris build NV99 to improve byte swapping performance. These changes will be a part of OpenSolaris 2008.11.
To review, byte swapping reverses the order of bytes in a integer (whether 2-, 4-, or 8-byte lengths). This is necessary as x86 processors store the low order byte of an integer first ("little endian"), and SPARC processors store the high-order byte first ("big endian"). Furthermore, data on the network is usually transmitted in big endian order and needs to be translated on little endian machines. As to which "byte endianness" (aka "byte sex") is better, this has been a subject of heated "religious" debates (see Cary and Cohen in references).
Byte swapping |
Bite swapping |
|||||
| Don't confuse byte swapping with bite swapping | ||||||
My changes are as to:
/ amd64.il / uint64_t htonll(uint64_t hostlonglong); .inline htonll,4 movq %rdi, %rax / copy parameter 1 to return value bswapq %rax / byte swap return value .end
What were the performance benefits? Refer to the chart below. On the upper left, you can see performance improvement with BSWAP_32 and BSWAP_64 on X86-64 bit class systems. The most dramatic was for BSWAP_32 running 32-bit object, but every category showed improvement.
![]()
Legend: "AMD64 32b" is 32-bit binary running on AMD64. "EM64T 32b" is 32-bit binary running on Intel EM64T. "AMD64 64b" and "EM64T 64b" similarly are 64-bit binaries running on AMD64 and EM64T, respectively. Time is in microseconds using a microbenchmark (100 million function calls in a loop).
Next, refer to the bottom half of the chart. This shows X86-64 performance improvements with various byte swapping macros. This is from substituting inline assembly for the LE_* and BE_* byte swapping macros (LE for Little Endian and BE for Big Endian). Performance for the LE_IN32 macros were marginal or negative, so I left them unchanged (that is, they remain implemented as C << and >> shift operations). However, improvements for the LE_*64 and the BE_*64 macros showed consistent improvement and these are now implemented in inline assembly.
Even SPARC optimization was possible (see the top right chart). This was done by rewriting the BSWAP32 and BSWAP64 macros—not in assembly, but more-efficient C. Consider this BSWAP_64 macro definition:
The BSWAP_64 definition looks straightford and innocent, but it's very inefficent. BSWAP_64 expands to BSWAP_32, then BSWAP_16, then BSWAP_8. The full macro expansion turns out to be this frightening monster, suitable only for Halloween
#define BSWAP_64(x) ((BSWAP_32(x) << 32) | BSWAP_32((x) >> 32))
I rewrote it to this a less elegant, but faster implementation:
#define BSWAP_64(x) ((((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) | (((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) << 32) | ((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) | (((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) >> 32)
#define BSWAP_64(x) (((uint64_t)(x) << 56) | \ (((uint64_t)(x) << 40) & 0xff000000000000ULL) | \ (((uint64_t)(x) << 24) & 0xff0000000000ULL) | \ (((uint64_t)(x) << 8) & 0xff00000000ULL) | \ (((uint64_t)(x) >> 8) & 0xff000000ULL) | \ (((uint64_t)(x) >> 24) & 0xff0000ULL) | \ (((uint64_t)(x) >> 40) & 0xff00ULL) | \ ((uint64_t)(x) >> 56))
Posted at 02:56PM Oct 31, 2008 by DanX in Solaris | Comments[2]
Will these improvements be used by ZFS for its adaptive endianness code?
Posted by Derek Morr on October 31, 2008 at 03:16 PM PDT #
It should in theory. In practice, I had difficulty observing the performance improvement. I used zons created on x86 and sparc, and exported/imported copied zfs pool images between x86/sparc. I saw performance improvement of 4-5% real time in reading filesystem metadata (find . -exec ls) on amd64, but none on Intel em64t. Little or no difference on sparc. This was images copied to swapfs (/tmp). Maybe it's not a big factor or I measured wrong. Please see details at this thread:
http://www.opensolaris.org/jive/thread.jspa?messageID=273757
Posted by Daniel Anderson on October 31, 2008 at 04:07 PM PDT #