Before I dive in, some Minion basics
I'm about to start posting about the internals of Minion. The problem is that it's hard to find a place to start. That is, I'm trying to describe a single part without having to describe all the parts.
In writing a first post about Minion, I find myself making blank blog entries for the things that I need to get to later. So far there are about a dozen. I will attempt in this post to set up some basic terminology that should forgo (I hope!) some confusion. Still, if you encounter something that you don't understand or doesn't make sense, say so in the comments and I'll try to clarify.
Minion indexes documents. Each document is expected to have a unique key. The application doing the indexing is in charge of deciding what the keys are and making sure that distinct documents have distinct keys. A document is composed of fields and values. Although we say that a document is a map from field names to field values, it's actually a multimap. A single field can have multiple values.
Fields can have attributes that specify how the field should be treated during indexing and affect what kind of queries can be asked about the field. The attributes are:
- Tokenized
- The text in the field will be broken into tokens using our universal tokenizer
- Indexed
- The tokens in the field will be added as terms to the main dictionary. When querying, you can query for terms in a particular field.
- Vectored
- The terms in the field will be added to a document vector for this field. These document vectors can be used in document similarity computations as well as classification and clustering operations.
- Saved
- An exact copy of the data in the field is stored in the index. Saved fields must have a type associated with them. We currently support string, numeric (64 bit) and date fields. This data can be queried using parametric query operators (e.g., <, substring) and retrieved for display for a particular search result
Fields can have any combination of these attributes.
Minion indexes documents until a (configurable, of course!) memory limit is reached. At this point an index partition is dumped to the disk. A partition consists of a number of files. When enough partitions of a given size have been dumped to the disk, the partitions are merged into a larger partition. A Minion index that's been in use for quite a while will typically have a couple of large partitions and several smaller partitions (think of a pyramid with the big partitions at the top.)
This approach will be familiar to anyone who has used Lucene.
All indexing is updating in Minion, so if an application indexes a document with a key that already exists in the index, the old information will be removed and the new information will be returned for queries that match the document.
A Minion index can safely be opened, indexed into, and queried by multiple Java threads and multiple Java processes. Minion periodically accesses the index to make sure that it has the most up-to-date set of partitions for querying.


Steve, good start.
Your post looks like it would be great as a Wiki article that could evolve over time. Perhaps a Minion Wiki is in order?
Posted by Jeff on April 21, 2008 at 09:53 PM EDT #
I'm sure you'll get to this, but can you explain a little about in what circumstances people should consider Minion instead of Lucene?
Posted by Nick L on April 22, 2008 at 06:31 AM EDT #
Jeff, in fact you've pretty much hit the nail on the head. One of the things that we need is more documentation about how to use Minion and about the guts of the engine. This series of blog posts is meant to encourage me to write about this stuff and I expect that it will make it's way to a wiki in fairly short order.
Nick, I'll be posting about Minion and Lucene within the next couple of days.
Posted by Stephen Green on April 23, 2008 at 08:46 AM EDT #