Wednesday Jun 17, 2009

Our intern arrived at the beginning of June and promptly began putting bugs into my pristine and utterly bug free search engine code.

One of the bugs he planted/discovered was that the recent change to use a BST for part of the dictionary lookup resulted in a BST search that was a bit, shall we say, overzealous in searching the tree. I fixed the bug and committed the change. Today I had a bit of time, so I decided to re-run the multi threading tests to see how the change affected our dictionary lookup times.

I was expecting the times to be worse, because I thought that perhaps we were not finding the right entries (but the dictionary code checks for that), but here's what I found:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)
12385001.26
23010001.99
45165002.32
85885004.08
1610080004.77
3245200021.66
6456650034.48
128116350033.62
200167950036.30

That's a huge speedup at the low end: somewhere between 3 to 5 times! Something weird happens in the middle where the results are worse, but then at the top end is settles down to be more than twice as fast. Also notice that I added a result for 200 threads (hey, why not, right?)

Here's the graph:

I have to tell you: it's pretty cool to see a machine with 256 hardware threads looking like this in prstat:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  1529 stgreen   100M   86M cpu169  30    0   4:00:48  71% java/223

Onward to query performance...

Thursday May 28, 2009

I've been working on the scalability properties of the dictionaries again. The last time, thanks to Jeff, we managed to get the dictionary to scale pretty well up to 32 threads and acceptably up to 64 threads by removing the synchronization bottlenecks on allocating lookup states during dictionary lookups and by using positional reads on the files containing the dictionaries.

Once we'd done this, I went back to collect/analyze and had a look at where the synchronization time was being spent during a run of our dictionary test program with 128 threads. I was surprised to see that a lot of synchronization time was being spent adding elements to the caches that we're mainitaining. This was a bit weird, because there's no explicit synchronization in the java.util.LinkedHashMap that we were using for the cache. I suspected that we were hitting an allocation problem with the objects that it allocated to put on the linked list that preserves the insertion order.

Aside from the synchronization problems, the main problem with the caches, in particular the entry-by-name cache, is that we're not getting that many hits on the cache during querying. In some of our query tests, the cache hit rate is about 10%, which means we're doing a lot of synchronization for not a lot of benefit.

So, I did away with both the entry-by-name cache and the name-by-position cache that we were using. The name-by-position cache actually was used: over time it was building up the top few layers of the binary search tree for the names in the dictionary. Unfortunately, this useful property was overwhelmed by the need to synchronize on the cache to fetch out the entry name at a given position while binary searching.

So I decided to add an honest-to-goodness binary search tree to the dictionary. This tree is populated when the dictionary is loaded, and since it never changes, it's completely thread safe. Because we took out the other caches and because the BST is so simple, we can afford to devote more entries to the BST. Every level that we add to the tree may save us a read from the underlying file, which is good, because the fastest read is the one that you don't have to make.

Each node in the BST stores the name at that position in the dictionary and the upper and lower extents of the search, so that once we've traversed the BST we know which chunk of the dictionary entries to continue the search in.

Here's the results for the new dictionary test, where we use a 1024 node BST (i.e., a tree 9 levels deep), along with a graph (note the log scale!) comparing this one to the old one:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)
1485006.20
2855007.04
41520007.91
82640009.16
1645750010.63
3275650012.86
6481700023.86
12852500074.22

That gives us a speedup of about 1.8 times at 64 and 128 threads, and a slightly smaller speedup of 1.6 times with smaller numbers of threads.

The last time I did this, I got a comment asking why we were even hitting the disk when we had so much memory on this box. That's a good question, but I'm pretty sure that we are hitting the disk. If we copy the index into a RAM disk and run the dictionary test, here are the numbers and the corresponding graph:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)
16850000.43
210720000.56
419270000.62
833875000.71
1638580001.24
3239335002.44
6439605004.86
12839925009.67

So, yeah, I'd say we're hitting the disk. That 9.67ms per lookup with 128 threads is pretty nice. That's about 123 times faster than original code was doing with a disk-based lookup.

While I was debugging my BST, I went ahead and modified the DictionaryFactory so that when you open a dictionary that is smaller than the specified cache size, we just return you a CachedDiskDictionary that loads the dictionary into a hash table when the dictionary is opened, since it all would have been cached eventually anyways.

Tuesday May 19, 2009

Another post about Minion's dictionaries today. We recently got hold of a really big box: it has 256 hardware threads and 256GB of RAM. This lead us to ask the question: How does Minion scale on this kind of hardware. Our initial experiments running queries on a couple of million documents with a varying number of threads (powers of 2 up to 128) showed us that as we increased the number of threads we were spending more and more time doing dictionary lookups.

Because of our EIDAP philosophy, we need to be sure that our dictionaries have good performance especially the multi-threaded case. We've tried out things on 4 or 8 processor machines, but nothing like the new beast. Although I'm writing about it, Jeff did all of the hard work here. The Sun Studio collect/analyze tools turned out to be exceedingly useful for doing this analysis.

We built a test program that selects a number of terms from a dictionary on-disk and then looks them up in the dictionary. A lot. For the runs that we'll be describing, we selected 10,000 terms. This list of terms is handed out to a number of threads. Each thread shuffles its list and then looks up the terms from its list in the dictionary until a time limit (300 seconds by default) passes.

Here's the state of affairs before we started:

Number of threadsTotal Number of LookupsAverage Lookup Time (ms)
13000010.05
23100019.49
43100039.31
83250077.97
1635500152.74
3242000296.15
6452500565.72
128640001195.01

Oh, dear. Not very good: we're pretty close to doubling the time when we double the number of threads, which is kind of the opposite of the definition of scalability. These times are fairly acceptable when we're doing querying with a small number of threads, because they're swamped by all of the other work that we're doing, like uncompressing postings. Once we get up to larger numbers of threads (around 16 or 32), the dictionary lookup time starts to dominate the times for the other work.

We started out by inspecting the code for doing a get from the dictionary. I described the way that it worked in a previous post, but the basic idea is that we do a binary search to find the term. We have an LRA cache for entries in the dictionary indexed by name that is meant to speed up access for commonly used terms. We also have an LRA cache for entries indexed by their position in the dictionary that is meant to speed up the binary search. Since dictionary access is multithreaded, we need to synchronize the cache accesses.

This was the only synchronization that was happening in the dictionary's get method, so we figured that was what was causing the scalability bottleneck. Unfortunately, a quick change to reduce the amount of synchronization by removing the entry-by-position cache didn't make any difference!

This is where collect/analyze comes in. It turns out that it can do a pretty good job of giving you visibility into where your Java code is spending its synchronization time, but it also shows you what's happening underneath Java as well. Jeff ran up the tools on our big box and I have to say that we were surprised at what he found.

The first step of a dictionary fetch is to create a lookup state that contains copies of the file-backed buffers containing the dictionary's data. Although we provided for a way to re-use a lookup state, the test program was generating a new lookup state for every dictionary lookup, which meant that it was duplicating the file-backed buffers for every lookup. The tools showed us two synchronization bottlenecks in the buffer code: the first was that we were using a non-static logger, and getting the logger caused synchronization to happen. The second was that we were blocking when allocating the byte arrays that we used to buffer the on-disk dictionary data.

We were surprised that the allocator wasn't doing very well in a multithreaded environment, and it turns out that there are (at least) two multithreaded allocators that you can use in Solaris. Unfortunately, Java isn't linked against these libraries, so using them would require LD_PRELOAD tricks when running Minion. We've always tried to avoid having to have a quadruple bucky Java invocation, and we didn't want to start now.

The answer was to use thread-local storage to store a per-thread lookup state. When a thread does its first dictionary lookup the lookup state gets created and then that lookup state is used for all future lookups. Don't worry: Jeff was careful to make sure that the lookup states associated with unused threads will get garbage collected.

Once we had that working, we re-ran our tests and got better results, but still not great. So, back to collect/analyze. Now we were seeing a bottleneck when reading data from the file. This turned out to be synchronization on the RandomAccessFile in the FileReadableBuffer. In order to read a chunk of data from the dictionary, we need to seek to a particular position in the file and then read the data. Of course, this needs to be atomic!

An NIO FileChannel offers a positional read method that does the seek-and-read without requiring synchronization (this may not be the case on some OSes, so caveat optimizer!) Our final modification was therefore to introduce a new file-backed buffer implementation, NIOFileReadableBuffer that uses a FileChannel and an NIO buffer to store the data.

We added a configuration option to the dictionaries so that we could select one or the other of the file-backed buffer implementations and then re-ran our tests. Here's the results after this change, along with a nice graph.

# of threadsTotal Number of LookupsAverage Lookup Time (ms)Speedup
13000010.141.0
25800010.401.9
43100012.413.2
83250015.115.2
163550017.748.6
324200022.4113.2
645250043.8012.9
12864000131.769.1

Clearly, this is a keeper. At 32 threads we're doing a lot better than we were at 4 threads and almost better than we were doing at 2 threads! We start to see the times doubling again as we get to 64 and 128 threads.

Because of the nature of the test, we're not hitting the dictionary cache very much, so I expect we're starting to run into some contention for the disk here (the index is residing on a ZFS pool that's on disks that are in the box). Of course, I thought I knew what the problem was at the beginning, so back to collect/analyze we go!

Friday May 01, 2009

Here in the Labs we have a Big Data reading group. The idea is that we get together once a week and discuss a paper of interest. We've covered a lot of the famous ones, like the initial papers for GFS and MapReduce. A couple of weeks ago, I volunteered to tackle the paper from Stanford that lays out methods for running a number of standard machine learning techniques in a MapReduce framework.

The Apache Mahout project was started to build the algorithms described in the paper on the Hadoop MapReduce framework (the original paper describes running the algorithms on multicore processors.) They've also brought in the Taste Collaborative Filtering framework, which is interesting to us as recommendation folks. As it turns out, they had just released Mahout 0.1. around the time we were going to read the paper.

Coincidentally, Amazon had just announced their Elastic MapReduce (EMR) service that lets you run a MapReduce job on EC2 instances, so I decided to see what it would take to get Mahout running on EMR.

I didn't manage to get it running in time for the reading group, but one Mahout issue and a few "Oh, that's the way it works"es later, I had it running.

Apparently I'm the first person to have run Mahout on Elastic MapReduce, which just shows, as my father used to say, that brute force has an elegance all its own.

If you're interested the details are on the Mahout wiki.

Tuesday Apr 21, 2009

Leonard Lin has some interesting notes on distributed key stores. He implemented a distributed key store for a client based on Tokyo Cabinet and a consistent hashing scheme to distribute data.

The really interesting part of this though, is that he had a look at a lot of the available options (admittedly on a pretty tight schedule) and his conclusion is that, given the maturity of the options available you could probably write your own in a day.

I'm pretty interested in retrying some of this evaluation with our own data. I'm not sure how the numbers will compare given that we're doing inverted indexing on the text in the items that we're storing, but it will be interesting to find out!

Thursday Apr 09, 2009

A new album from The Tragically Hip came out this week. I'm listening to it now. I bought it from Amazon, and I just went to Amazon to see if the fine folks there could recommend any related artists for me. Today, after I searched for the band, Amazon offered a link to the "Tragically Hip Store", where "a fan can be a fan." I'm a fan, and I sure do want a place to be a fan, so let's check it out!

Here's The Tragically Hip Store at Amazon:

Now, don't get me wrong, I like Amazon (it's where I bought the new album, after all), but I'm very surprised to see that there's no recommendations on the page. There's links to the albums, but wouldn't a fan who's looking for a place to be a fan already have the albums and possibly be looking for other music?

Over there is The Tragically Hip on the Explaura where the it's all about the recommendations, and a fan can find other things to be a fan of.

Wednesday Apr 08, 2009

Here's a handy hint for controlling a load of EC2 instances (or some other griddy type of thing) from a single terminal in MacOS.

Check out the awesome screenshot:

Of course, if your data store is running on 128 instances, you're going to have teeny-tiny terminal windows...

Monday Apr 06, 2009

As a Canadian, I grew up in a culture that was dominated to a pretty large extent by American culture (e.g., watch any good Canadian movies lately?) The one area where this was not as much of an issue was music, mostly due to the CanCon regulations that required radio stations to air a certain percentage of Canadian music. This could lead to some bad results (hearing enough Celine Dion on the radio, yet? No? Well, here's some more!), but also to some good results.

I did my postdoc work at Macquarie University in Sydney. I was already a fan of a couple of Aussie bands (most notably Midnight Oil) but I was pleasantly surprised to discover a lot of other great Aussie bands (e.g., Regurgitator and Spiderbait — you gotta love a band that can put together a two minute song that leaves you wanting more) mostly by listening to Triple M and Triple J.

A bug report for the Explaura dropped me right back into this Aussie music. I must admit that I had forgotten how much I liked some of this music. To a first approximation you would never hear about these bands in the states, so that's one in the win column for the Explaura.

Friday Apr 03, 2009

Paul said some pretty nice things about The Explaura today. He mentioned that I hadn't talked much about the backend system architecture, which we call the AURA Data Store (or just "the data store"). This is something that I will be talking about in future posts, but since he mentioned it, I wanted to give you an idea of what the AURA data store is like.

Here's a snapshot of our data store browsing and visualization tool (click for the full sized version):

You can see that the data store has almost 2 million items in it. These are the actual artists along with items for the photos, videos, albums and so on. Along with that are about 1.2 million attentions. An attention is a triple consisting of a user, an item, and an attention type. So, for example, if we know that a particular user listened to a particular artist then we would add an attention for that user with the item corresponding to that artist, and an attention type of LISTEN.

The layout of the visualization tool gives you a pretty good idea of the layout of the data store. At the top level there are data store heads, which is what clients of the data store will use for adding and fetching items and attention and for doing searches in the data store.

The data store heads talk to the second level of the data store, which is composed of a number of partition clusters. Each of the partition clusters manages a number of replicants (currently that number is 1, but stay tuned throughout the summer), which is meant to provide for scalabilty and redundancy in the data store.

Each replicant is composed of a BDB key-value store and a Minion search index. BDB is responsible for storing and retrieving the item and attention data and Minion is responsible for almost all of the searching and all of the similarity computations.

The addition of the Minion search index means that the AURA data store is a little different from typical key value stores because it makes it possible to ask the kinds of queries that a search engine can handle (e.g., find the artists that contain the word progressive in their Wikipedia bios, find artists whose social tags are similar to a particular artist) as well as the kind of queries that a traditional key-value store can handle (e.g., fetch me the item with this key.)

We refer to this data store as a 16-way store, because there are 16 partition clusters and 16 replicants. Each of the boxes that you see in the visualization represents a JVM running a chunk of the data store code. We use Jini in combination with our configuration toolkit to do service registration and discovery. It's all running on the Project Caroline infrastructure, and we'll be migrating it to the new Sun Cloud offering as soon as we can.

In our load tests, a data store like this one was capable of serving about 14,000 concurrent users doing typical recommendation tasks (looking at items, finding similar items, adding attention data) with sub-500ms response time at the client.

Thursday Apr 02, 2009

Doop-de-do, logging in to the new machine.

[stgreen@hogwarts 13:38:21 ~]$ 

Huh. I wonder how many processors this machine has.

[stgreen@hogwarts 13:38:22 ~]$ /usr/sbin/psrinfo -v
Status of virtual processor 0 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:19.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:22.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 2 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:22.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.

[...Many lines deleted...]

Status of virtual processor 253 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 254 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 255 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.

OK. That's a lot of processors there. I wonder how much memory it has.

[stgreen@hogwarts 13:38:27 ~]$ /usr/sbin/prtconf | head -2
System Configuration:  Sun Microsystems  sun4v
Memory size: 261856 Megabytes

Uh, wow. That's a lot of RAM. I wonder how much disk space.

[stgreen@hogwarts 13:43:49 ~]$ df -hl 
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0t0d0s0       15G    10G   4.6G    69%    /
swap                   161G   414M   161G     1%    /tmp
swap                   161G    56K   161G     1%    /var/run
scratch                134G    60G    74G    45%    /scratch

Heh. So I guess at start up we should just cache the disk then, right?

Anyone have any single instance, multi-threaded search scalability tests they want me to try?

Wednesday Apr 01, 2009

Today I'm (finally!) announcing the first offering from the AURA Project: The Music Explaura. The Explaura is a way for you to explore musical artists and find new ones that you might like, based on the words that people have used to describe the artists. We call the set of words used to describe an artist the textual aura for that artist.

You start out by searching for an artist that you know, say one of your favorite bands. The data store contains information for about 30,000 artists. Over on the left, you can see what the Explaura knows about one of my favorite bands, The Tragically Hip.

It's a bit hard to see (embiggened version), but this gives you some idea of the information that the Explaura collects for each band. There's a tag cloud (more on that in a bit), the artist's bio from Wikipedia, videos from YouTube, photos from Flickr, album covers from Amazon and upcoming events from Upcoming. You can click on the play icon to listen to that artist's radio at Last.fm.

On the left of the artist page, you see the list of similar artists generated by the AURA recommenders. This list of artists is generated using a technique that's quite a bit different than you're probably used to. Rather than relying on the wisdom of the crowds via a technique like collaborative filtering, the AURA system computes the similarity between artists by computing the similarity between their textual auras.

The tag cloud that the Explaura displays for an artist is a portion of the textual aura that the system uses to compute the similarity between two artists (in this case, it's social tags collected from Last.fm.) This cloud is a little different than the tag clouds that you typically see: here the size of a tag is not proportional to its frequency, but rather to its importance for this artist. Here's a better view of the cloud for The Tragically Hip:

As you can see, The Hip are a Canadian band that plays energetic, indie rock. How do we compute the importance of a particular tag in the cloud? Using our good friend from the information retrieval world, TFIDF. The idea is that a tag is important for an artist if it is applied frequently to that artist and infrequently to other artists (i.e., it does a good job of distinguishing this artist from others.)

Because we're using the textual aura to compute the similarity, it's easy to generate a set of words that explain the similarity betwen two artists. If you click on the "Why?" link next to one of the recommended artists, you'll be shown the overlap tag cloud for these artists. Here's the overlap cloud for The Tragically Hip and Sloan:

In this tag cloud, the size of a tag is related to how much that particular tag contributed to the similarity between the artists. So the fact that both The Hip and Sloan are Canadian played a pretty big part in their similarity, along with the fact that they're both literate indie rock outfits.

One more thing about the artist's tag cloud: if you click on one of the tags in this cloud, you'll be taken to a page for that tag. This page will look a lot like the artist page: it shows information about the tag itself including the artists for whom the tag is important. The tag cloud that is shown on the tag page is built from the tags that are most similar to the tag that you clicked on. Here's the tag page for classic rock:

But what if I want things that are like The Hip, but I don't just want Canadian music? That's where steerability comes into play. Each artist has a little steering wheel icon next to it. When you click on that icon you're taken to the steering interface:

The steering interface starts out with a tag cloud that has the most important tags for the artist. On the left, you see the artists that the AURA system recommended based on their similarity to this steering tag cloud. On the right, you can see a selection of tags from the artist. Clicking on one will add it to the steering cloud. Note that as you add tags, the recommended artists are updated in real time. You're not restricted to the tags that have been applied to that particular artist, either. You can search for tags to add using the handy search box.

The really cool thing here is that the tag cloud is interactive: you can drag a tag to increase or decrease its importance. If you drag upwards on a tag, the tag gets larger and more important. If you drag downwards on a tag, the tag gets smaller and less important. If you drag a tag small enough, it goes negative and is shown with a strike-through. When a tag is negative, no artists with that tag will be recommended.

If we make the canadian tag smaller, then it's less important and we get bands for which canadian is less important. We can add the literate tag (because we like literate music!) and make it bigger, which makes it more important. Again, the recommendations are updated for each change in the cloud, so you get direct feedback as to how your changes are affecting the recommended artists. Here's my new steering page:

And there's a band that I've never seen before: Classic Case. Now I can click on the play button and see if I like their music.

If you don't want a tag in the steering cloud, you can right-click on it and select "Delete" from the menu. If you click on "Sticky" in that menu, then any recommended artists must have that tag in their aura. You can click on "Negative" in this menu to quickly make a tag negative.

It's probably a lot easier to see this in the demo video that Paul made last year:

As you can see, there's been lots of updates since the video was made, but there's still lots more to be done (for example, it's very annoying that canada and canadian are considered to be different tags), but we're pretty proud of how good the recommendations are turning out to be. I've discovered several new bands that I like using the Explaura.

There's a link for email feedback at the bottom of the Explaura interface, so let us know what you think. I'll be posting more about the Explaura and AURA in the future, so stay tuned.

Thursday Mar 26, 2009

Well, Penny Arcade pretty much nails cloud computing. It happens in the sky!.

OnLive does sound good, but I'll be interested to see how bad the input lag is. If it works, can I finally get Verizon to give me a SunRay desktop at home, so that I don't have to dick around with Windows anymore?

And while we're at it: Violence Fight!

Monday Mar 23, 2009

From CNet, we see that Craigslist bests MySpace as top search term. The article contains a graph from hitwise showing the relative popularity of searches for craigslist, myspace, and facebook.

This is weird to me, because these are the actual domain names of the things that the users are (in all likelihood) searching for. Typing them into the location bar in Firefox takes you right to the appropriate place in each case. But a search engine is how you find things right?

It seems pretty obvious to me that anyone searching for one of these is unlikely to want to find all the occurrences of the term craigslist and it seems like the obvious thing to do with such a query is to just have the search engine issue a redirect to the appropriate resource. Of course, Google lives by showing you ads, so they're not going to do that, but they should (and do!) have the site come up at the top of the listings.

Not to give away too much, but if you look at the top searches for Sun's intranet, you see the same phenomenon: a number of the top searches in any given month are the actual host names for internal tools (the one you use to book vacation, the one you use to reserve a conference room, etc.) as well as for things like java.

Properly speaking, this is a search problem, but not a search engine problem, per se. A search engine can easily find all the places where a given word (or domain name) occurs, but it takes someone who understands the user community of the search engine and the purposes to which a search interface is going to be put to make sure that useful things happen.

It seems to me that if you're going to offer a search interface to a wide community, you need to be sure that you're watching your query logs and noticing things like the fact that people are search for the internal tools and then making sure that queries for those things are being handled in the appropriate way.

For example, behind the firewall, a search for a particular Web-based application should simply re-direct to that application (perhaps with a delay for the one-in-a-million person who actually wanted to search for that term.) A search for a person's name (or a close spelling) should generate a results page with just their information from the corporate directory.

In my experience, however, search seems to start cycling around how much stuff is being crawled and what document formats are supported for indexing, rather than on how it can be made truly useful.

At least it means there's lots of low-hanging fruit for a dedicated search guy!

Friday Mar 20, 2009

I've actually started to convert our AURA project code to use the new Minion query API, and as a result I decided that having a single Relation class for all of the relational operators really makes code that uses the query API look ugly.

I've just checked in a change to the query API that provides individual classes for the relational operators. These are all simple specializations of the Relation class, but it makes the code using the API a lot prettier. For example, this:

        Element query = 
	   new And(new Relation("aura-type",
                                Relation.Operator.EQUALS,
                                "artist"),
                   new Relation("aura-name", 
		                Relation.Operator.SUBSTRING,
        			artistName));

becomes this:

        Element query = 
	   new And(new Equals("aura-type", "artist"),
                   new Substring("aura-name", artistName));

Which is a lot neater, in my opinion. Don't ever forget that your API is the user interface for your library!

While debugging some problems yesterday I discovered the need to print out the queries, so while I was in there, I added toString methods that produce a nice string version of a query expressed using the API. The above query will be converted to the string:

(And (= aura-type artist) ( aura-name coldplay))

(assuming that artistName is "coldplay", obviously!) And yes, that is an s-expression. Lisp programmers represent! (OK, OK, it's not really an s-expression (spaces in the field values will screw things up), but it's close, and the parentheses delineate the expressions nicely.)

The changes to the code have been committed and we should have new javadoc up later on today.

Wednesday Mar 18, 2009

Mostly just a note to myself here, but this post on speeding up K-means clustering should come in handy when I get back to Minion's clustering code.

I think we might already be covered, because we're usually clustering document vectors, and those are represented sparsely by nature. The memory locality is something we'd need to keep in mind, though.

They're getting a minute per epoch on a "few hundred thousand" text messages. I'll have to see what Minion's clustering performance would be like on that size of data. In our internal use of the clustering we have good interactive-time (i.e., you can run clustering as part of an async HTTP request in a Web app) clustering performance up to about a thousand (short) docs using K-means.

This blog copyright 2009 by searchguy