Another approach to passage retrieval
Just so you don't think I'm only blowing my own horn here, I wanted to point out another interesting approach to passage retrieval. It has a number of things in common with our approach, primarily that it tries to find the smallest containing passage for the query terms, without resorting to tricks like breaking a document into paragraphs for indexing.
The engine that embodies this technique is called Multitext. It was developed at the University of Waterloo. In the interests of full disclosure, I went to Waterloo and I know most of the folks involved in this project.
The Multitext system treats a document as an ordered sequence of words:
D = d1, ..., dm
A query is treated as a set of terms:
Q = {q1, q2, ...}
An extent (u,v) is said to satisfy a term set T (which is a subset of Q), if the subsequence of D defined by the extent contains all the terms in T. An extent ( u,v) is a cover for T if (u,v ) satisfies T and the subsequence corresponding to the extent contains no other subsequence that satisfies T.
Passages are scored by their length and the weight assigned to the terms from the query that they match. Of course, the tricky part is to do this quickly across millions of documents. This approach has proved itself successful in several of the the TREC competitions in areas as diverse as Web-scale retrieval and question answering.
One of the other interesting properties of Multitext is that, while it's aware of things like XML markup and it allows you to use that markup in queries, it doesn't require either the documents or the queries to be particularly well formed, and you don't have to teach it about your DTDs.
This technology is being commercialized by another old friend of mine. His company, isagn, has implemented XTeXT, which is a front end to the Multitext engine. If you're interested in digital library applications of search, the demos I've seen look pretty good.


Posted by Mark Harwood on February 03, 2005 at 09:01 AM EST #