In May (yes, I'm a little behind in my posting, thanks for asking!) John Batelle pointed out Ask Jeeves' new Zoom tool. Run a search for The Beatles and you're presented with a selection of phrases that you can use to narrow or expand your search.

All of these phrases appear to have been generated using a statistical model based on document clustering. This is in contrast to what I described last week, which is a knowledge-based approach.

The good thing about using these kind of clustering approaches is that, in some sense, the data you've indexed tells you what relationships are important. Furthermore, you're not necessarily tied to a particular language (depending, of course, on how much language modelling you're doing to pick out good representative phrases) and you don't have to do all the work of building your knowledge base of relationships.

Another important thing to keep in mind: finding noun phrases and classifying them correctly into a semantic taxonomy (e.g., classifying a hairy brown dog as a kind of hairy dog and a kind of brown dog ) is a pretty computationally intensive task.

The bad thing about these approaches is that it can be difficult to explain (or even figure out) why two phrases are related. An example: in the search linked above, "Woodstock" is offered as a phrase in "Expand Your Search."  As far as I know, The Beatles didn't appear at Woodstock but documents that mention The Beatles must also mention Woodstock fairly often. For our taxonomies, the relationships are much more explainable.

As with most things, I think an approach that can integrate knowledge based techniques and statistical techniques is probably the way to go.

One weird thing about Zoom: when you click on one of the narrower or broader terms, it appears to just do a boolean search with those words. It seems like if you went to all the trouble to cluster the documents and find the representative phrases, you would want to save the places from which you drew the phrases. I guess when you're dealing with the whole Web, that's too much stuff to keep around.

Comments:

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by searchguy