Björk vs. Bjork vs Bork
Artist name matching can be a real pain in the neck. First, there are all of those international spellings and affectations like KoЯn, Die Ärzte, 宇多田ヒカ, and Spın̈al Tap. In addition, there are variations: "Emerson, Lake and Palmer" vs. "Emerson, Lake & Palmer", or "The Beatles" vs. "Beatles, The". And finally, there are the many misspellings (500+ ways to spell Britney Spears). A typical collection of MP3s has many such matching challenges infecting the artist, album and track names. So what to do? You can use DAn Ellis's method for artist name normalization, but this doesn't work too well for international names, nor does it handle misspellings. You could use an audio fingerprinter like MusicDNS, that will extract an encoding-insensitive hash for an audio file that can be turned into a song ID, but if you have a million songs to process this can take quite a while.
An alternative is to use some of the algorithms specifically designed for name matching. These algorithms can deal well with the typical errors found in names. One matching algorithm, the Jaro-Winkler metric was designed by a couple of statisticians at the U.S. Census Bureau to explicitly deal with name matching.
There's a good overview of the various name-matching algorithms in this paper, and there's a neat open source Java package that implements many of these algorithms called SecondString.

Posted by Rackhelp on May 01, 2007 at 10:38 PM EDT #
Posted by Nick on May 02, 2007 at 02:03 AM EDT #
Posted by Paul on May 02, 2007 at 08:03 AM EDT #
Posted by Jeff on May 02, 2007 at 08:24 AM EDT #
Posted by Hal on May 02, 2007 at 08:50 AM EDT #