Disclaimer: Please don't construe this post as trashing Google. I like Google: I've been using it since it was google.stanford.edu. I know some very smart people that work there. I think they do a great job under very difficult conditions. This post is about the inherent difficulty of search.

In a previous post, I pointed out (somewhat elliptically, I've been told) that the Google query " google spelling corrector" showed some of the deficiencies of straightforward keywords search.

What I meant to demonstrate with that particular query (and what I should have said) was that a straightforward, weighted boolean search engine doesn't always find the best documents, partly because the measure of "aboutness" that they use to determine whether a document is relevant to a particular query is based solely on the relative frequencies of the query terms in the documents and in the collection as a whole.

So, yesterday I was talking to Josh Simons and he said that he didn't quite understand what I was getting at. The reason that he didn't get it is that the original post containing that query is now the top hit in a Google search for "google spelling corrector!"

These are both effects of Google's PageRank algorithm. Part of their aboutness measure is a notion of popularity 1: If lots of pages point to the target page, then the score for the target page is increased.

Bloggers write blogs that contain lots of links to pieces in other blogs 2, so blog posts start to look pretty "about" (as far as Google is concerned) whatever keywords they contain. This is not necessarily wrong: in some sense my previous post was about Google's spelling corrector, but probably not in the sense that someone searching for information about the corrector would find useful.

If you want another example starring me, here it is: Up until this week, a Google search for "steve green" would get you pages and pages of hits about the gospel singer with whom I share a name. I'm fairly certain that this means that I'm the evil twin. To get to me, you had to run a query like "steve green computational linguistics". Right now (on Friday night), I'm hit number 2.

So, this is one of the hard problems (You could probably get away with saying that it's the hard problem) in search: what is a reliable measure of the aboutness of a page, given a query? I don't think there's any one answer to this question. I think it would be difficult to come up with a measure of aboutness that would scale well from collections of 10,000 emails to collections of 8,058,044,651 web pages.

Keep in mind that sometimes it's difficult for people to describe to other people what they are interested in, so it seems like a bit much to expect computers be able to do it flawlessly.

At this point you may be saying to yourself: but what about natural language processing and AI and the semantic web? I'll be talking more about the semantic aspects of search in future posts.

1 That's a big oversimplification that people use all the time, but I

  1. don't want to bore you, and
  2. don't want to have to type mathematical formulas in HTML

2For example, in this post I've linked to myself and to Josh.

Comments:

I thought Google already has some sort of phrase searching (when searching for "google corrector spelling", your blog entry is only the third result). It's just that they give more weight to the "popularity" of the page/site.

Posted by Dan on January 31, 2005 at 06:16 AM EST #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by searchguy