Minion and Lucene: Query Languages and Ranking
Thursday May 08, 2008
The query languages for the two engines provide the usual suite of boolean operators with their usual interpretations. One difference is that, in Lucene, the NOT operator must be used in opposition to a term, it can not appear on its own. That is, the query: NOT dog is not valid. Only queries like cat NOT dog are valid. In Minion, NOT is a fully general operator that can be applied to any subexpression in a query.
As we've mentioned in previous installments, Minion provides a number of query operators that Lucene does not, most of which are aimed at modifying the case behavior or variant behavior when looking up query terms in a dictionary.
Minion provides a full suite of proximity operators, all of which are based on the Sun Labs Relaxation Ranking Passage Retrieval algorithm. This allows for compatibility of scoring between (for example) NEAR, PHRASE, and PASSAGE queries. Minion's query evaluator allows you to do some pretty cool things, like finding phrases that are near each other.
Lucene provides PHRASE and NEAR operators in its query language, but there is no indication of how the nearness of terms will affect the score assigned to a document. Recent versions of Lucene have introduced the notion of query spans, which are document, start position, and end position tuples. These spans can be used to implement any number of proximity operators, although they must be used programatically (i.e., they are not part of the query language.) They seem similar in intent to the work done at Waterloo on Multitext.
Minion is designed to provide any number of query languages that can be selected per-query. We have developed a Lucene query parser that handles the Lucene query language as well as parsers for our own query language and a "Web" query language.
One thing that Lucene has that Minion doesn't (and we want!) is an API for constructing queries programatically. This is really useful, especially in situations where you want to build queries from input from a user interface. If the users are entering full strings (e.g., for substring matches in an email subject), if you want to build a query for Minion, then you need to build a string containing whatever query operators are appropriate. Of course, you need to make sure that the query you build is syntactically valid so that the query parser doesn't choke on it.
Lucene provides an API that you can use to construct queries, essentially it allows you to build the structure that the query parser would generate (not exactly, but you get the idea!). You could do this in Minion, but it would be extremely unpleasant to do so. I expect that we'll be adding a programmatic query API in the relatively near future.
As far as document ranking goes, if I understand correctly, both Minion and Lucene use a variation of the standard TFIDF weighting function, and both allow for users of the system to implement any particular weighting function that they like (e.g., I'm pretty sure there's a BM25 implementation for Lucene).
I would expect that the ranking differences that you would see between Minion and Lucene on a given query on a given document collection would probably be due more to how the respective engines are configured (e.g., was stemming or morphological expansion used?), rather than any fundamental differences in the performance of the engines.






Ah, the NOT functionality sounds great. What exac...
Yeah, we use the old trick of having a negative se...
I also think that there is no BM25 implementa...
I'll go look at the passage info in a bit.
But you...
Can any of these operators be used to run proximit...