Artist name matching can be a real pain in the neck.  First, there are all of those international spellings and affectations like KoЯn,  Die Ärzte,  宇多田ヒカ, and Spıal Tap. In addition, there are variations: "Emerson, Lake and Palmer" vs. "Emerson, Lake & Palmer", or "The Beatles" vs. "Beatles, The".  And finally, there are the many misspellings (500+ ways to spell Britney Spears).  A typical collection of MP3s has many such matching challenges infecting the artist, album and track names.   So what to do?  You can use DAn Ellis's method for artist name normalization, but this doesn't work too well for international names, nor does it handle misspellings.  You could use an audio fingerprinter like MusicDNS, that will extract an encoding-insensitive hash for an audio file that can be turned into a song ID, but if you have a million songs to process this can take quite a while.

An alternative is to use some of the algorithms specifically designed for name matching.  These algorithms can deal well with the typical errors found in names.  One matching algorithm, the Jaro-Winkler metric was designed by a couple of statisticians at the U.S. Census Bureau to explicitly deal with name matching. 

There's a good overview of the various name-matching algorithms in this paper, and there's a neat open source Java package that implements many of these algorithms called SecondString. 
 


Comments:

I know this sounds crazy, but how about just editing the filenames / m3u info whenever you notice something isn't where it needs to be? Work' for me :) -Ken

Posted by Rackhelp on May 01, 2007 at 10:38 PM EDT #

using something like the "soundex" function in mysql would be a good choice?

Posted by Nick on May 02, 2007 at 02:03 AM EDT #

Ken - yep, fixing on the fly works for personal collections, but what do you do when you have an industrial-sized collection. A typical MIR researcher is likely to have a database of several hundred thousand tracks. That's a whole lot of time spent editing filenames.

Posted by Paul on May 02, 2007 at 08:03 AM EDT #

If you are looking for a good Jaro-Winkler implementation, take a look at LingPipe; LingPipe 3.0 was just released. Here is a post on their blog about their Jaro-Winkler implementation.

Posted by Jeff on May 02, 2007 at 08:24 AM EDT #

"DAn Ellis" - case in point

Posted by Hal on May 02, 2007 at 08:50 AM EDT #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by plamere