Sun Labs staff engineers Stephen Green and Paul Lamere presented a fun and intriguing session, injected from time to time with humor, titled “Project Aura: Recommendation for the Rest of Us,” about recommendation engines or recommenders. Project Aura strikes me as very important in that there is so much amazing stuff on the web and so much junk and so little time to sort through it. It is not grandiose to say that we are in a shift in human consciousness that historians and sociologists will spend a long time trying to comprehend. Already psychologists are wondering if the shift from a literate to a web culture where you can google and instantly gain masses of information is altering our minds in a way similar to what happened when massive literacy occurred in the 19th century. We now have huge virtual libraries available to us in our homes.
So we really need recommenders.
Here’s the basics:
Project Aura is an open-source recommendation engine, written in the Java language, being developed at Sun Labs. Recommendation technology is a key enabler for the next-generation web. Recommenders will be essential to helping us wade through the huge volume of content found in news, music, video, blogs, and podcasts. However, most recommenders work through collaborative filters – the wisdom of crowds -- which basically correlate groups of people who bought similar things. So if people who bought X are also people who bought Y and you just bought X, they recommend Y. Sometimes even people who went to one product site who also went to another are correlated. Project Aura takes a novel approach to recommendation, avoiding many of the problems inherent in traditional recommender systems.
It’s obvious that there is a huge amount on the web – anyone who’s gotten even a little bit addicted to YouTube knows that, with the proliferation of user-generated content – the sky’s the limit. Here’s my eccentric example. I have a darling Maine Coon kitty who is astonishingly obsessed with Q-tips and loves to curl up in the kitchen sink. There’s no one else like him – or so I thought until my boyfriend told me there are 18 YouTube videos of Maine Coons in kitchen sinks and a large number showing cats who are obsessed with Q-tips. Everyone wants their kitty to be a star. So I’m one of those strange people who is fascinated with cats who love Q-tips and sinks. How should software classify me so that I might find other things of interest? How should recommendation engines be organized to suggest other videos for someone who is fascinated by cats who love Q-tips and sinks?
Recommendation engines are a huge business. 2/3 of movies rented through Netflix were recommended; they are offering $1 million prize for the first group to improve their recommendations by 10%. At Amazon, 35% of product sales result from recommendations.
The Long Tail
Stephen Green pointed out that in pop culture a few things tend to be hugely popular like best-selling books, prime time sit coms, and top 40 songs – these show up on a graph as the short head at one end of the tail. But under the long tailing curve, there is actually more material than in the short head, representing the diversity that the Web does a decent job of offering. Amazon provides access to a huge number of books; iTunes offers access to three million songs. And traditional search technologies are pretty good at helping you find things on the web, especially if you know their name.
But: what about recommendation engines? There are some problems. First, they come up with some crazy recommendations. Paul Lamere presented a recommendation from iTunes, the world’s largest music store. If you liked Brittany Spears, you’ll also like a report on pre-war intelligence. “I can’t imagine any context in the world that would make this be a good recommendation,” said Lamere.
A second recommendation from iTunes: after putting a CD of Gregorian chants on a cart, iTunes recommended Amy Winehouse and some other popular music, which had absolutely nothing to do with Gregorian chants. Why? Probably because so few people buy Gregorian chants at iTunes that there was nothing to correlate it with, so by default, they recommended what’s popular. The wisdom of crowds as a last resort.
A third, hilarious recommendation from Amazon: after ordering a box set of C. S. Lewis’s “Chronicles of Narnia,” Amazon recommended a nose, ear, and hair groomer. Go figure. It’s an example of Amazon’s cross-domain recommendation that somehow turned up a correlation between the two.
To a collaborative filter, the world consists of consumers who get clustered together based upon their consumption. Such filters work when a site has popular content and numerous users defining the items. They’re easy to implement with relatively simple math, but they fall apart if you want them to recommend something deep in the long tail where there are items with few users. And what if a new band has a new track with no users yet, how can a collaborative filter recommend it? It’s a self-reinforcing system in which the same old bands stay at the top and the unknown find it hard to get recommended. Lamere compared it to a chameleon sitting on a mirror.
Scaling can also be a problem – in both large and small sites. Sites with small content lack enough data points for collaborative filtering to work well. At the other end, companies like Google, Amazon and Netflix have millions of users, billions of case data points and you can do a lot of recommending, but with such large data sets, such systems are hard to create.
Project Aura: Recommendations for the Rest of Us
The goal of Project Aura is to give everyone a chance to do good recommending, regardless of size. Instead of focusing on the correlation between users of a product, they focus on the similarity between two items based upon the “aura” surrounding the product, which is not as amorphous as it first sounds. This is very different from a collaborative filter that focuses on the users who consume same or similar content. So the key is to create a set of words that describe the content itself and base recommendations of other items on similarity of the word auras that surround them. Since the words can be derived from content, they can be immune to many of the issues of collaborative filter systems. If we can derive the words in the aura from the items themselves, a new item can come into the system and be recommended from day one. It won’t have to wait for thousands of users to be recommended.
Where do they get the words? Take the music space – in popular or medium level content, there are many sites that offer a repository of descriptions of music. Data can be garnered from many sites from Wikipedia to MusicBrainz, a community music meta-database that attempts to create a comprehensive music information site. And lyric sites provide lyrics to album tracks.
But what about new music that has no listeners yet? A sister project to Project Aura involves searching inside music and doing content analysis directly on audio to extract features to tag on the audio. The program will take a brand new song, do analysis of it and apply the tag that would be applied by people. The program extracts features from the audio that corresponds to the timbre, harmonic content, and rhythm. It learns from already well-labeled music to apply models of appropriate tags. Very cool.
After exploring some of the issues and complexities of tagging, the presenters shared some Project Aura Java API code, and closed with a quick demo of a research prototype for Project Aura that generated music recommendations based upon text auras.
See Also
Paul Lamere's blog
Janice J. Heiss