I have always been told that it is important to
design hash
functions carefully in order to ensure that they return values
distributed evenly across the entire range of possible
values. Otherwise, you risk that many distinct keys map to the same
bucket in the hash table, and the hash lookups in the worst case
degenerate to linked list traversals. Although this sounds like a very
good piece of advice in theory, I have never seen a real-world example
where a naive implementation of a hash function has done any harm,
until now.
I was making some changes to a stress test that ran simple join
queries against an Apache
Derby database. The changes were quite innocent, I was just
changing the order in which the tables and indices used in the joins
were created, so I didn't expect them to influence the test
noticeably. But they did! When I got the test back up and running, its
performance had dropped to less than 50% of what it was before.
My first thought was that the changed table/index order had somehow
tricked Derby's query optimizer into choosing a different query
execution plan than before. If the optimizer makes a bad decision (or
an uninformed decision, like
in this
case recently discussed on the derby-user mailing list), performance
drops of this size, or even greater, may be experienced. However, when
I made Derby print the query plans, I couldn't see any difference at
all.
The next step was to profile the test in order to find out why it took
so long. I have found collect/analyzer
from Sun Studio to
be a valuable tool for such tasks. You simply start your application
from the command line with "collect -j on" followed by the command you
normally start it with:
collect -j on java MyApp arg1 arg2 ...
When I opened the collected performance data in analyzer, this is what
I saw:
Okay, more than 60% of the CPU time is spent
in RecordId.equals(), whose only job is to check the
equality of four integer fields in two RecordId
objects. Since it's a cheap method, it means that it must have been
called very frequently in order to spend that much of the total CPU
time. Further investigation showed that almost all the calls to this
method came from a call to ConcurrentHashMap.get() in Derby's
lock manager, which used RecordId objects as hash table
keys. However, there was nothing in the profiling data indicating
that ConcurrentHashMap.get() was invoked more frequently
now than it was before. So then there had to be more calls
to RecordId.equals() per call to
to ConcurrentHashMap.get().
Now, what could cause that? The only thing I could think of, was
collisions in the hash table caused by a bad hash function. I inserted
some print statements in the Derby code and found out that the test
visited about 350 000 distinct records in the database. In the original
version of the test, 350 000 distinct RecordId values
mapped to only 12 000 distinct hash values, which is bad to start
with. In the new version of the test, the number of distinct hash
values had dropped even further down to 6000. When I looked at the calculation of the
hash value in RecordId.hashCode(), it was pretty obvious
how this could happen. The hash value was constructed by taking the
bitwise XOR of these four integers:
recordId- A record number which is unique within the
page on which the record exists
pageNumber- The id of the page containing the
record
containerId- The id of the container (that is, table or
index) containing the page
segmentId- Uninteresting variable in this discussion,
since it's always 0
Since all of these variables essentially are counters starting from 0,
they tend to be in the lower value range, and all the higher bits are
unused. When you take the bitwise XOR of such integers, the higher
bits will still be unused, and you'll end up with lots of values
clustered in a relatively small range. This basically means that
collisions are bound to happen. When I changed the table and index
creation order in the test, I must have been unlucky and got less
variation in the containerId variable for the tables
involved in the join, hence the lower number of distinct hash values
and the increased frequency of collisions.
So, now the problem had been identified: the hash function is
biased. But how could it be fixed? I mentioned that I've always heard
that careful design is needed for hash functions. Unfortunately, I
must have missed the lecture where we were taught how to write good
hash functions, if there ever was one. Luckily, after a couple of
unsuccessful attempts to fix it, I discovered
that NetBeans had this
nice Insert Code menu.
I only needed to pick Generate hashCode() and tell which
fields I wanted to include in the hash code calculation. NetBeans then
generated a RecordId.hashCode() method which looked like
this:
public int hashCode() {
int hash = 7;
hash = 89 * hash + pageId.hashCode();
hash = 89 * hash + recordId;
return hash;
}
It doesn't look much more complex than the old method which only
XOR'ed the fields, but thanks to the repeated addition and multiplication with a
prime number, the hash values will be distributed across a wider
range, even if the individual fields don't differ very much.
(Apparently, this is very similar to the approach suggested by Joshua Bloch in his
Effective Java Programming Language Guide.)
I repeated the procedure for PageKey.hashCode(), which is
called from RecordId.hashCode().
Then, when I reran the test with the modified Derby code, the
performance was back at its normal level, more than twice as high as
without the fixed hash functions. And the 350 000 records that
previously mapped to 12 000 or 6000 distinct hash values, now mapped to
280 000 distinct hash values. Not bad to get such an improvement
with just a couple of mouse clicks in NetBeans, and not writing a
single line of code myself! :)