Today I pushed the changes for 6823354 which adds intrinsics for {Integer,Long}.{numberOfLeadingZeros,numberOfTrailingZeros}() methods. The speedups are quite good:
| |
Integer | Long | |||
| numberOfLeadingZeros | numberOfTrailingZeros | numberOfLeadingZeros | numberOfTrailingZeros | ||
| Intel Nehalem | 32-bit | 3.18 | 3.96 | 1.36 | 1.90 |
| 64-bit | 3.83 | 3.74 | 2.02 | 2.17 | |
| AMD Shanghai | 32-bit | 1.94 | 3.55 | 0.98 | 2.44 |
| 32-bit w/ lzcnt | 4.90 | - | 1.46 | - | |
| 64-bit | 2.52 | 3.09 | 1.86 | 3.26 | |
| 64-bit w/ lzcnt | 6.77 | - | 3.71 | - | |
| UltraSparc T2 | 32/64-bit | 2.01 | 2.22 | 1.55 | 1.91 |
"w/ lzcnt" in the table means the numbers are using AMD's LZCNT (count leading zeros) instruction which is part of SSE4a.
The SPARC intrinsics need a hardware implementation of the POPC instruction.
Yet I haven't found a real-world application that uses these methods extensively (including bitCount), but if anyone knows one, please let me know.
java.util.BitSet.nextSetBit / nextClearBit (typically used when sieving primes, using the Erathostenes sieve) can benefit from such optimizations. It's a pity that I have no access to such processors to publish a benchmark.
Posted by Edson Watanabe on May 08, 2009 at 07:20 PM CEST #
A quick search didn't reveal an application that uses these methods extensively. But if you provide me a benchmark, I will publish some numbers.
Posted by Christian Thalinger on May 08, 2009 at 08:01 PM CEST #
Today's strongest chess-playing problems use so-called "bitboard" representations and could benefit from those operations. See, for instance, "How DarkThought Plays Chess":
http://people.csail.mit.edu/heinz/dt/node7.html
Posted by Mauro Persano on May 08, 2009 at 09:58 PM CEST #