Minion and Lucene: Using the Engines
When we started development on the precursor to Minion, we were targeting Portal Server and (eventually) Web Server. The developers in Portal Server and Web Servers weren't search guys, so we developed an indexing and retrieval API for the engine. This API lives in the com.sun.labs.minion package, and is intended to be all that developers would have to know to use all of the capabilities of the engine.
If developers use the API, then we can make sure that when we modify the internals of the engine, their code will continue to work (although such changes might require re-indexing data.)
Lucene doesn't really have a well-defined API. Developers are expected to understand more of the inner workings of the engine and to construct the appropriate parts of the indexing pipeline. This may seem a bit daunting to the new user, but there are lots of samples and the Lucene community is very helpful. It's also important to note that the the Lucene developers are very serious about backward compatibility and versioning the files in the index.
One important aspect of engine usability is the workload that the engine imposes on the applications that use the engine. Minion tries to handle most of the housekeeping associated with indexing and searching documents. For example, Minion views all indexing as updating. Documents have unique keys, and when you index data using a key that's already in the index, then the older document is deleted before the new document is added to the index. Data from deleted documents is removed from the index when partitions are merged. Lucene requires the application to delete old documents before indexing new ones.
When Minion opens an index, it starts a thread that watches the list of partitions that are active in the index (the cleverly named "active list"). When this list changes, Minion updates the set of partitions that are used for running queries by loading any new partitions and closing any partitions that have been merged into larger partitions. Lucene requires the application to periodically reopen the Searcher to load changes to an index. In some circumstances, this can be a very expensive operation.
The engines also differ in their approaches to how a search engine is configured. In Lucene, configuration of the engine is done in source code. A developer selects an Analyzer, a Reader, a Directory, etc. and then compiles the program. Changing the configuration of the engine requires a recompile.
In Minion, the bulk of the configuration of the engine is done via an XML configuration file that specifies the properties associated with a number of the components that make up the system. This allows for quick reconfiguration of the engine for experimental runs, changing cache sizes for dictionaries, and so on. The configuration system that we're using is a modification of the one used for the Sphinx 4 speech recognition system. Most of our changes have been around adding capabilities to the configuration system. For example, we allow components to inherit properties from other components and we provide for transparently handling registering and fetching components from a Jini service registrar.
While this is a fundamental difference, it's not clear that one approach is more useful than the other, especially in situations where the engine will be configured by the developers and delivered to end users as part of a product.
The flexibility could be useful in situations where one would, for example, like to try different term weighting functions on a given collection to see which is the most suitable for a given application. For Minion, this requires only a few small configuration files to use as input for the indexing application.

