Wednesday Jan 26, 2005

Boolean Search

In an earlier post, I mentioned in passing the problems with boolean search. Our very own Tim Bray provided a pretty fair description of how boolean search works. One thing that you should notice about this description is that boolean search doesn't take into account the positions of the words in the document. So, for example, if we have a query like:

man bites dog

then a 2,000 word document where man is word 1, bites is word 1,000, and dog is word 2,000 is considered to be just as relevant as a document where man is word 10, bites is word 11, and dog is word 12. Clearly, though, if you're looking for stories about men biting dogs, the second document is more likely to be relevant to your query.

Boolean search is all about document retrieval. When you issue a query, what you get back is a list of documents and it's up to you to decide whether any of the documents that you got back are relevant to your query. Most search engines will try to help by showing you a snippet of the document's content. These days that snippet may be dynamically generated from your query, but a lot of the time it seems that the same statistical techniques used to rank a document are used to generate the snippet!

Passage Search

At Sun Labs, we have been focused on passage search. Given a query, our passage retrieval algorithm will find the closest matching passage in the indexed material. A "perfect passage" for us is a passage that has all the words from the query, in the order that they appeared in the query, with no intervening text.

Any passage that we find that deviates from this ideal is assigned a penalty that increases with the severity of the deviation. If there is some text intervening in the passage that wasn't in the query, the passage is penalized proportional to the amount of intervening text. The idea here is that, as the words get farther apart in the text, it's less likely that they are being used in the sense that you intended in your query.

If the words in the passage are in a different order than in the query, the passage is penalized a slightly larger amount proportional to how far the terms are out of order. The idea here is that if the terms are out of order in the passage, then the likelihood increases that the syntax of the query differs from that of the passage.

Even a small deviation like the passage containing dogs when the query contained dog is penalized, because it's not exactly what the user asked for.

Depending on which query operator you use to do a passage search, we can even allow passages that don't have one or more of the query terms in them, although this is penalized very heavily.

Once we've found the best passages, we can rank the documents containing those passages by listing the documents with the best passages (i.e., those with the smallest penalty) first.

Let's talk about that example for a second:

man bites dog

Using our passage retrieval approach, the second document we described above will definitely appear higher in the ranking than the first, because it has a perfect passage. Furthermore, that second document would also rank higher than a document that contains the (much more likely) phrase: dog bites man. Again, this is a distinction that most boolean searches will not make.

What About Phrase Search?

It's true, our perfect passages are what you would get if you used a phrase query. The problem is that if you use a phrase query for something, then that is all you will get. If there are no matching phrases, then you get nothing, even if there are some very close matches in the index (e.g., a document containing man bites a dog!)

Given a powerful query language, here's what typically happens: You start out with a phrase query and you get nothing because it's too specific a query. Then you try an OR query but you get too much stuff because that query is too general. So you decide on what terms have to be in your query and you AND those and use an OR on the rest. But, hey, a couple of those words should be pretty close together, so you toss in a NEAR. How close together? Uh, let's say seven words, so that's a NEAR/7.

Each of these queries requires you to compose the query (Oh, by the way, normal people can't or won't do this!), wait for the search engine to evaluate it, and then look at the results to see whether you got any documents that will satisfy whatever information need that you have.

The beauty of the passage retrieval algorithm is that it does all of this for you: it starts out with the most restrictive query and when it can't find any good matches, it starts to gradually relax the restrictions, penalizing any passages that it finds. Thus the full name of the algorithm: Relaxation Ranking Passage Retrieval.

Another nice thing about the passage retrieval algorithm is that it gives you a very nice, focused snippet to display in the query results. In fact, even when our engine is doing straight boolean retrieval, we use the passage retrieval algorithm to find the best snippet from a document.

Specific Information Retrieval

What we're talking about here is specific information retrieval. This is the kind of searching that we do most of the time: you're doing something and a question comes up. You want to do a search, get an answer and get back to what you were doing.

The passage retrieval algorithm is great for this kind of search: if you type in a piece of a question it will do its best to try to find a match for that question. If you can find the question, it seems likely that the answer will be nearby. Usually, we show a context of a few dozen words around the best passage when displaying a result so that if the answer is very nearby, you don't even have to click through to the document!

This blog copyright 2009 by searchguy