Saturday Jul 05, 2008
Monday Jun 30, 2008
Tuesday Jun 24, 2008
For a first experiment, I'd like to see if I can automatically find synonyms for the tag "female vocalists". To do this, I need to establish some ground truth. By hand, I've gone through the 5,000 most frequently applied artist tags, looking for tags that may be related to "female vocalists". I found 59 of them shown here (along with the tag rank and tag frequency).
12 89277 female vocalists 80 15874 female 113 8281 Female fronted metal 128 7150 female vocalist 227 2841 riot grrrl 236 2716 female vocals 278 2180 Female Voices 365 1424 female artists 480 955 female vocal 569 698 female singers 571 691 female fronted 619 619 Girl Groups 633 600 Girl Rock 722 501 girls 739 475 diva 786 433 riot grrl 880 374 female singer-songwriter 885 370 women 1023 301 girl group 1064 289 chick rock 1067 287 girl power 1119 269 female singer-songwriters 1202 245 female singer 1224 239 female voice 1544 179 female-vocalists 1587 173 female rock 1625 167 female dance vocals 1650 163 girl music 1727 154 french female 1757 151 chick music 2113 120 Girl 2130 118 Female-fronted Metal 2246 110 chicks 2252 109 woman singer 2681 89 female fronted rock 2757 86 Female Artist 2803 84 girl bands 2893 81 girlie rock 2895 81 female fronted band 3077 75 japanese female vocalists 3082 75 divas 3135 73 girly 3136 73 girl band 3194 71 girl pop 3201 71 eleCtro grrls 3263 69 solo female 3405 65 with female singers 3426 65 front girl band 3609 62 60s girls 3676 60 grrl 3788 58 females 3884 56 woman 3957 55 female vocal trance 4044 54 Female country 4188 51 grrls 4370 49 Female solo artists 4778 44 Favourite Females 4828 43 girl punk 4923 42 girls aloudNow I want to place the tags into 3 separate buckets - In bucket 1, I'll put tags that I think are synonyms for "female vocalists". In bucket 2, I'll put tags that are related but not synonyms, and in bucket 3, I'll place tags that are not related to "female vocalists" at all.
Bucket #1 - Synonyms for "female vocalists"
These are female oriented tags (singular or plural), that don't imply any type of genre.12 89277 female vocalists 80 15874 female 128 7150 female vocalist 236 2716 female vocals 278 2180 Female Voices 365 1424 female artists 480 955 female vocal 569 698 female singers 571 691 female fronted 619 619 Girl Groups 722 501 girls 739 475 diva 885 370 women 1023 301 girl group 1202 245 female singer 1224 239 female voice 1544 179 female-vocalists 1650 163 girl music 1757 151 chick music 2113 120 Girl 2246 110 chicks 2252 109 woman singer 2757 86 Female Artist 2803 84 girl bands 2895 81 female fronted band 3082 75 divas 3135 73 girly 3136 73 girl band 3263 69 solo female 3405 65 with female singers 3426 65 front girl band 3788 58 females 3884 56 woman 4370 49 Female solo artists
Bucket #2 - Related but not synonyms to "Female Vocalists"
female-oriented tags that imply genre or include another type of qualifier such as 'favorite'.113 8281 Female fronted metal 227 2841 riot grrrl 633 600 Girl Rock 786 433 riot grrl 880 374 female singer-songwriter 1064 289 chick rock 1067 287 girl power 1119 269 female singer-songwriters 1587 173 female rock 1625 167 female dance vocals 1727 154 french female 2130 118 Female-fronted Metal 2681 89 female fronted rock 2893 81 girlie rock 3077 75 japanese female vocalists 3194 71 girl pop 3201 71 eleCtro grrls 3609 62 60s girls 3676 60 grrl 3957 55 female vocal trance 4044 54 Female country 4188 51 grrls 4778 44 Favourite Females 4828 43 girl punk 4923 42 girls aloud
Bucket #3 - Not related to "Female Vocalists"
(all of the rest of the 5,000 tags).Dividing the female-oriented tags like this is not so clear cut, but we have to start somewhere ... I'm open to any other suggestions as to how to divide this space up. Once we have some ground-truth (imperfect as it may be), we can develop an evaluation criteria that will let us determine how well our synonym detector works.
The next step is to figure out how we can evaluate our synonym predictor. That will be the next post.
Sunday Jun 22, 2008
Another excellent resource is the podcast interview by Jon Udell of Jean-Claude Bradley. Jean-Claude is a professor of chemistry at Drexel University who started to make the scientific process as transparent as possible by publishing all research work in real time to a collection of public blogs, wikis and other web pages. He coined the term Open Notebook Science which he describes as: "... there is a URL to a laboratory notebook (like this) that is freely available and indexed on common search engines. It does not necessarily have to look like a paper notebook but it is essential that all of the information available to the researchers to make their conclusions is equally available to the rest of the world. Basically, no insider information."
I really like the idea of 'no insider information' - to put all your successes and failures, every experiment, every bad result out there for everyone to see. Your lab notebook is open for the whole world to see.
Jean-Claude has a presentation that outlines how they use Open Notebook science at Drexel, that is pretty interesting (albeit, quite focused on chemistry). They use a Wiki to serve as the lab notebook. They rely on the automatic versioning of the wiki software to maintain track of edits, so they always know 'who did what' and 'when'. This addresses some of the concerns Elias raised about making the research record be permanent. They will then use their blog to highlight interesting results or questions.
To me, this is all pretty interesting, especially for the Music Information Retrieval community. The MIR community has lots of disciplines: musicology, signal processing, machine learning, library science, text IR, coding, user interface, and on on. No one can know all there is to know in this field, so anything that can help increase the opportunities for sharing ideas can really push forward the whole research community.
There are already a number of MIR researchers that are putting their research on line. ( Mark Godfrey Yves Raimond to name just a couple). I suspect that more researchers would work this way if the tools were available and the advantages were laid out. Perhaps this would be a good topic for a panel at ISMIR this year. Since Jean-Claude Bradley is at Drexel there could even be an opportunity to have Jean-Claude sit on the panel to serve as the 'expert'. This panel could be about how to do 'open notebook science' with some feedback from folks who are already doing that in the MIR community. I, myself, would find this panel to be very interesting.
Of course, we are way past the time to submit panels to ISMIR, so there may not be any chance to have such a panel, but it doesn't hurt to try .. so if enough MIR folks express some interest to me (just add a comment to this post or send me an email), I'll talk with the ISMIR organizers to see if this would be possible.
Saturday Jun 21, 2008
Friday Jun 20, 2008
From Comedy Central insider (Thanks, Zac!)
Thursday Jun 19, 2008
Monday Jun 16, 2008
(Via Andrew Huff.)
Friday Jun 13, 2008
- Acid Rockabilly
- British Hip Pop
- Blackened Broken Metal
- British Stoner Rap
- Coast Rock
- Depressive political rock
- Dirty Neo-Prog
- Emo Metal
- German uplifting death-grind
- Gothic country grrrl
- Hardcore chick trance
- Indie anarcho pop
- Industrial old northern house
- Liquid Psychedelia
- New symphonic hip-hop
- Organic Metal
- Progressive teen Hardcore
- Suicidal Swedish viking rock
- Traditional ambient deutschpunk
- Twee Afrobeat
Thursday Jun 12, 2008
Wordle is pretty cool (via Information Aesthetics)
My edited hand-edited list of genres can be found here: genre.txt
Update: Zac suggested making the genres links to the last.fm tag page .. so you can see the full list with links after the jump:
Update #2: Oscar went one better, he made the genres link page with links sized by frequency of occurrence: Oscar's last.fm genre tag cloud
Tuesday Jun 10, 2008
i am a child in a field and i grow things in my dreams to wake and water them to go to sleep and raise them from the earth into existence from an idea to physical existence what power what madness a gentle madness an eleven year olds madnessThe longest tag that has been applied multiple times is:
songs that i sing along to but i always forget the words so i say duh duh while trying to sound like i do know the words and no one is falling for it but they keep quiet because they are embarassed for me
Some long tags that seem to be about music :
bands or people from india or who sound like they might be from india or who sound like they might be from around india or who sound like they might enjoy bands or people who sound like they are from around indiaA number of the longer tags seem to be the result of someone being confused about how to enter separate tags. For example, this single tag was probably meant to be 20 individual tags:
rock - metal - industrial - electronic - punk - emo - ebm - alternative - punk - dark electro - japanese - metalcore - female fronted metal - techno - psytrance - love metal - new wave - synth pop - indie
Some long single word tags:
63 Eksprimentell-teknisk-avantgarde-saer-dop-steikbra-metal-musikk 55 SkankledoodleskattysuperSKAalifragilisticexpialidocious 55 No-break-twitch-screaming-grindcore-ninja-commando-team 47 indiepostteletronicrockistsbaroquechamberaltpop 47 hatebreedlookalikessoundsalikeswannabesgoodshit 46 rrrrrrrruuuuuuuuuuuuuuuuuuuuuuggggghhhhhhhhhhh 43 up-down-up-down-left-right-A-B-select-start 40 put-your-brain-in-a-blender-and-drink-it 40 baseballbatmusicforcoolindiekidswithbats 39 fffffffffffffffffffffffffffffffffffffff 38 where-do-thoughts-like-those-come-from 37 good-old-animation-guitar-syntheziser 36 gothic-techno-industrial-dance-disco 36 MAurrrrriiiiittttttiiiiiuuuuuussssss 36 21AAD6B6-A1E6-4e07-B2E9-9F512F446E4B 35 onemightbecomebarrenlisteningtothis 35 not-to-be-confused-with-THE-Citadel 35 Supercallifragilisticexpialidocious 35 Lie-on-the-floor-and-smoke-to-music 35 BatmanUcanBeMySupermanSaveMeHereIam 34 Progressive-Industrial-Atmospheric 34 Industrial-Electroica-Jungle-DeathOnce again, many of these seem to be the results of the confused tagger not understanding how to apply multiple tags at once.
317 Unique tags that start with "I ..." such as:
11 I would like to own or listen to more music by these bands and artists 11 I got a patch 11 I dig it 11 I dig 10 I must try 10 I can be cool sometimes 10 I LOVE THIS MUSIC 10 I Have Seen Them 9 I think its a doom metal band 9 I like this stuff 9 I cant account for these 9 I am so sad 8 I love this bands 8 I love them all 8 I like this music 8 I like this 8 I like these guys a lotMy favorite is: I was never a Swedish teenager with swedish teen angst so now I will attempt at having it. Speaking of 'favorite', Last.fm taggers have hundreds of ways of tagging their favorite artists. Among them are:
FavoriteArtists favoritebands FAVORTIEEESSS Favoritmusik Favoritizims favoritttes favoiteness faveartists favorites1 favorieten faveorites Favourites Favoritter favroites favoutite favourits favourite favour-01 favorites favoriten favoritas faveorite favaurite fav0urit3 Favouites Favoritos Favoriter favorito favoriet favoirte favbands Favoritt Favorits Favorite favorit Favoris FavBand favour favori favor8 favies favela Favvis Favess FavArt favvy favou favos Favvo Favor Faves favs favo fava Fave favTags can be crazy - there are just so many different reasons why people tag, there's tons of noise, there are tag abusers - but because there are so many tags, they can also be extremely useful. Next up, we'll try to sift through the tags and categorize them into big buckets like genre, mood, opinion and so on.
First, here's a plot of tag frequency. First thing to notice is the power distribution. Not surprisingly, it looks like tags follow Zipf's law (where the frequency of a tag is inversely proportional to its rank). Interestingly, 45,000 of the 100,000 or so tags have been applied only once.
Looking at a log-log plot of the frequency vs. rank data, we see that the data is linear starting at a rank around 25. For tags at rank less than 25 we see a tailing off from Zipf's law. I think this tailing off is to be expected. The most applied tag 'rock' is not really very descriptive. I suspect that many taggers will self-edit and not apply obvious tags.
This next plot gives a closer look at the 5,000 most frequently applied tags. In this dataset, about 2,500 tags have been applied 100 times or more.
The top 25 tags applied are:
440854 rock 343901 seen live 277747 indie 245259 alternative 184491 metal 158252 electronic 136691 punk 124599 pop 119930 indie rock 102937 classic rock 97264 alternative rock 89277 female vocalists 79497 emo 77455 death metal 76898 Hip-Hop 76668 hardcore 74650 electronica 73034 singer-songwriter 69169 black metal 62284 jazz 60559 hard rock 59763 folk 59729 punk rock 58135 Progressive rock 57860 heavy metal 54398 industrial
In the top 25, almost all of the tags are genre related (with the exception of 'seen live' and 'female vocalists'). The data appears to skew away from what one would expect from a general listening population. There's no 'Country' in the top 25, but there are 4 kinds of metal.
In future posts, I'll take a closer look at the various types of tags.
The dataset is available for download here: Lastfm-ArtistTags2007
Here are the details as told in the README file:
The LastFM-ArtistTags2007 Data set
Version 1.0
June 2008
What is this?
This is a set of artist tag data collected from Last.fm using
the Audioscrobbler webservice during the spring of 2007.
The data consists of the raw tag counts for the 100 most
frequently occuring tags that Last.fm listeners have applied
to over 20,000 artists.
An undocumented (and deprecated) option of the audioscrobbler
web service was used to bypass the Last.fm normalization of tag
counts. This data set provides raw tag counts.
Data Format:
The data is formatted one entry per line as follows:
musicbrainz-artist-id<sep>artist-name<sep>tag-name<sep>raw-tag-count
Example:
11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>american<sep>14
11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>animals<sep>5
11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>art punk<sep>21
11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>art rock<sep>18
11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>atmospheric<sep>4
11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>avantgarde<sep>3
Data Statistics:
Total Lines: 952810
Unique Artists: 20907
Unique Tags: 100784
Total Tags: 7178442
Filtering:
Some minor filtering has been applied to the tag data. Last.fm will
report tag with counts of zero or less on occasion. These tags have
been removed.
Artists with no tags have not been included in this data set.
Of the nearly quarter million artists that were inspected, 20,907
artists had 1 or more tags.
Files:
ArtistTags.dat - the tag data
README.txt - this file
artists.txt - artists ordered by tag count
tags.txt - tags ordered by tag count
License:
The data in LastFM-ArtistTags2007 is distributed with permission of
Last.fm. The data is made available for non-commercial use only under
the Creative Commons Attribution-NonCommercial-ShareAlike UK License.
Those interested in using the data or web services in a commercial
context should contact partners at last dot fm. For more information
see http://www.audioscrobbler.net/data/
Acknowledgements:
Thanks to Last.fm for providing the access to this tag data via their
web services
Contact:
This data was collected, filtered and by Paul Lamere of Sun Labs. Send
questions or comments to Paul.Lamere@sun.com
What's all this then? This is an experiment in 'open research' - I'm going to blog my research on a particular topic. Suggestions are welcome
Table of Contents
This blog copyright 2008 by plamere






Social Tags and MIR - a tutorial at ISMIR 2008
Social Tags are free text labels that are applied to items such as artists, playlists and songs. These tags have the potential to have a positive impact on music information retrieval research. In this tutorial we describe the state of the art in commercial and research social tagging systems for music. We explore some of the motivations for tagging. We describe the factors that affect the quantity and quality of collected tags. We present a toolkit that MIR researchers can use to harvest and process tags. We look at how tags are collected and used in current commercial and research systems. We explore some of the issues and problems that are encountered when using tags. We present current MIR-related research centered on social tags and suggest possible areas of exploration for future research.
I am really excited about working on this tutorial with Elias and JJ. One of the highlights of 2007 for me was presenting a tutorial on music recommendation with Oscar Celma. I learned so much about the subject matter while preparing the tutorial, and it was great fun to work with someone as smart as Oscar. I am really looking forward to repeating the experience.
Posted on: Jun 21, 2008
Posted by: plamere
Category: General
Permanent link to this entry | Comments [1]