Just so you don't think I'm only blowing my own horn here, I wanted to point out another interesting approach to passage retrieval. It has a number of things in common with our approach, primarily that it tries to find the smallest containing passage for the query terms, without resorting to tricks like breaking a document into paragraphs for indexing.

The engine that embodies this technique is called Multitext. It was developed at the University of Waterloo. In the interests of full disclosure, I went to Waterloo and I know most of the folks involved in this project.

The Multitext system treats a document as an ordered sequence of words:

D = d1, ..., dm

A query is treated as a set of terms:

Q = {q1, q2, ...}

An extent (u,v) is said to satisfy a term set T (which is a subset of Q), if the subsequence of D defined by the extent contains all the terms in T. An extent ( u,v) is a cover for T if (u,v ) satisfies T and the subsequence corresponding to the extent contains no other subsequence that satisfies T.

Passages are scored by their length and the weight assigned to the terms from the query that they match. Of course, the tricky part is to do this quickly across millions of documents. This approach has proved itself successful in several of the the TREC competitions in areas as diverse as Web-scale retrieval and question answering.

One of the other interesting properties of Multitext is that, while it's aware of things like XML markup and it allows you to use that markup in queries, it doesn't require either the documents or the queries to be particularly well formed, and you don't have to teach it about your DTDs.

This technology is being commercialized by another old friend of mine. His company, isagn, has implemented XTeXT, which is a front end to the Multitext engine. If you're interested in digital library applications of search, the demos I've seen look pretty good.

Comments:

Hi Steve, You may want to take a look at the Highlighter package I wrote for the Lucene project. It provides a framework for passage retrieval that could also be used outside of Lucene. It defines interfaces (and default implementations) for TokenStreams, Fragmenters, Scorers and Formatters which are each responsible for the different tasks involved in retrieving passages. All implementations are pluggable into the central coordinating Highlighter class. See here: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/highlighter/

Posted by Mark Harwood on February 03, 2005 at 09:01 AM EST #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2009 by searchguy