The importance of good hash functions
I have always been told that it is important to design hash functions carefully in order to ensure that they return values distributed evenly across the entire range of possible values. Otherwise, you risk that many distinct keys map to the same bucket in the hash table, and the hash lookups in the worst case degenerate to linked list traversals. Although this sounds like a very good piece of advice in theory, I have never seen a real-world example where a naive implementation of a hash function has done any harm, until now.
I was making some changes to a stress test that ran simple join queries against an Apache Derby database. The changes were quite innocent, I was just changing the order in which the tables and indices used in the joins were created, so I didn't expect them to influence the test noticeably. But they did! When I got the test back up and running, its performance had dropped to less than 50% of what it was before.
My first thought was that the changed table/index order had somehow tricked Derby's query optimizer into choosing a different query execution plan than before. If the optimizer makes a bad decision (or an uninformed decision, like in this case recently discussed on the derby-user mailing list), performance drops of this size, or even greater, may be experienced. However, when I made Derby print the query plans, I couldn't see any difference at all.
The next step was to profile the test in order to find out why it took so long. I have found collect/analyzer from Sun Studio to be a valuable tool for such tasks. You simply start your application from the command line with "collect -j on" followed by the command you normally start it with:
collect -j on java MyApp arg1 arg2 ...
When I opened the collected performance data in analyzer, this is what I saw:
Okay, more than 60% of the CPU time is spent
in RecordId.equals(), whose only job is to check the
equality of four integer fields in two RecordId
objects. Since it's a cheap method, it means that it must have been
called very frequently in order to spend that much of the total CPU
time. Further investigation showed that almost all the calls to this
method came from a call to ConcurrentHashMap.get() in Derby's
lock manager, which used RecordId objects as hash table
keys. However, there was nothing in the profiling data indicating
that ConcurrentHashMap.get() was invoked more frequently
now than it was before. So then there had to be more calls
to RecordId.equals() per call to
to ConcurrentHashMap.get().
Now, what could cause that? The only thing I could think of, was
collisions in the hash table caused by a bad hash function. I inserted
some print statements in the Derby code and found out that the test
visited about 350 000 distinct records in the database. In the original
version of the test, 350 000 distinct RecordId values
mapped to only 12 000 distinct hash values, which is bad to start
with. In the new version of the test, the number of distinct hash
values had dropped even further down to 6000. When I looked at the calculation of the
hash value in RecordId.hashCode(), it was pretty obvious
how this could happen. The hash value was constructed by taking the
bitwise XOR of these four integers:
recordId- A record number which is unique within the page on which the record exists
pageNumber- The id of the page containing the record
containerId- The id of the container (that is, table or index) containing the page
segmentId- Uninteresting variable in this discussion, since it's always 0
Since all of these variables essentially are counters starting from 0,
they tend to be in the lower value range, and all the higher bits are
unused. When you take the bitwise XOR of such integers, the higher
bits will still be unused, and you'll end up with lots of values
clustered in a relatively small range. This basically means that
collisions are bound to happen. When I changed the table and index
creation order in the test, I must have been unlucky and got less
variation in the containerId variable for the tables
involved in the join, hence the lower number of distinct hash values
and the increased frequency of collisions.
So, now the problem had been identified: the hash function is biased. But how could it be fixed? I mentioned that I've always heard that careful design is needed for hash functions. Unfortunately, I must have missed the lecture where we were taught how to write good hash functions, if there ever was one. Luckily, after a couple of unsuccessful attempts to fix it, I discovered that NetBeans had this nice Insert Code menu.
I only needed to pick Generate hashCode() and tell which
fields I wanted to include in the hash code calculation. NetBeans then
generated a RecordId.hashCode() method which looked like
this:
public int hashCode() {
int hash = 7;
hash = 89 * hash + pageId.hashCode();
hash = 89 * hash + recordId;
return hash;
}
It doesn't look much more complex than the old method which only
XOR'ed the fields, but thanks to the repeated addition and multiplication with a
prime number, the hash values will be distributed across a wider
range, even if the individual fields don't differ very much.
(Apparently, this is very similar to the approach suggested by Joshua Bloch in his
Effective Java Programming Language Guide.)
I repeated the procedure for PageKey.hashCode(), which is
called from RecordId.hashCode().
Then, when I reran the test with the modified Derby code, the performance was back at its normal level, more than twice as high as without the fixed hash functions. And the 350 000 records that previously mapped to 12 000 or 6000 distinct hash values, now mapped to 280 000 distinct hash values. Not bad to get such an improvement with just a couple of mouse clicks in NetBeans, and not writing a single line of code myself! :)
That's a nice pattern -- it's based on FNV (Fowler/Noll/Vo) hashing, but uses very small prime values, which generally yield poor hashes. FNV actually recommends different very large primes based on the hash length. I've used this very happily on maps containing millions of elements. http://isthe.com/chongo/tech/comp/fnv/
Posted by Andy Martin on January 10, 2008 at 03:15 PM CET #