Mark A. Basler's Weblog

All | Java
20060310 Friday March 10, 2006

Lucene Search Engine, Web Crawlers and Tagging...

Recently, I have been working on adding search functionality to a soon-to-be-released JavaEE5 application that deploys on Glassfish .  After looking over the open source search engine options that were available for us to bundle with the application, we decided on using the Lucene search engine available from Apache. Lucene is a robust search engine that supplies APIs that enable the developer to design an indexing scheme to match their needs.  They also follow the same methodology I strongly believe in, keep it simple (KISS).  Lucene doesn't supply the functionality that actually walks through your data to create the indexes or the functionality to capture the criteria to search/display the indexes, because its roll is strictly to be a search engine. The Lucene Tutorial with the accompanying demo applications walk through common creation/search scenarios depicting a straightforward methodology to write your own interfaces between your data and Lucene.

Web Crawlers:
A common desire is to have a web crawler or robot walk the web site, indexing relevant content so it can be later searched. There are some third party alternatives that can be used like Apache's Nutch which is built on top of Lucene or WebSPHINX that can be modified to store data into Lucene.  These and other third party open source solutions will help save a lot of time.  If you decide to go the web crawling route, you could try to write your own like I did using the JavaSE5 javax.swing.text.html.parser classes, but it is laborious and you will not be spending time focusing on the real problem of getting your data correctly indexed.  I did consider the effort to be a very educational endeavor, but not to be attempted by the faint of heart.

Our Approach:
Considering our needs and the fact that our JavaEE5 application's web interface is largely based on Web 2.0 that utilizes AJAX to present most page content.  We decided to write our own interface that pulls data from the database to create the indexes.  We found when serving page content using AJAX & DHTML, the web crawling paradigm becomes convoluted.  This is due to the fact that it is hard for the crawler to know the relevant content that is associated with a specific item.  This is especially true when the content is retrieved through Javascript events that haven't been fired by the crawler, like a Javascript mouseover.  This problem can be mitigated by methodical use of the robots.txt file and meta tags in the HTML pages that are served to give the web crawler the correct pages with the correct data.  Amazon uses this approach by including meta tags like "description" and "keywords" coupled with a restrictive robots.txt file to help companies like Google index their site correctly. 

We are using meta tags to help external web crawlers index our site properly, but thought our own search results should be as accurate as possible for the items we are offering.  To store the data to be indexed it was easy to use the new persistence APIs in JavaEE5 based on POJOs. With our development environment consisting of Netbeans 5.0 and the Glassfish AppServer development went very smooth.  All that was required to make the Lucene APIs available was to package it with our application.  One note, be careful where you store your indexes.  If you store them under the deployed application directory, they will be removed when you redeploy/update your application.  We decided to store the indexes under the domains lib directory (e.g. "/glassfish/domains/domain1/lib/indexDir"), which can be accessed using the Glassfish System.getProperty("com.sun.aas.instanceRoot") + "/lib/indexDir", but the location is totally up to you.

Tagging:
We also added the functionality to allow user's to add there own custom tags to the items so they are also searchable by other users.  Tagging has become very popular and can be seen by browsing pioneering sites like del.icio.us and flickr.   We wanted the ability to weight the tags based on subsequent user clicks, so the tag information also had to be persisted in the database. 

Updating Indexes in Lucene for Tagging:
One thing to keep in mind is that Lucene doesn't allow an index to be updated, the specific index has to be deleted then re-created.  When adding a new tag to an item or updating a document index you have to be able to access all the data that was originally in the index before re-creating it.  This sounds straightforward but there is on caveat.  If you index items using an approach that doesn't allow retrieval of all the data in the index, you will have to read the data from a persistent store so the index can be completely re-created.  You can get in this state when you create a org.apache.lucene.document.Field for the documents index utilizing the "UnStore" method or "Text" method with a Reader.  When using these methods, the data can't be retrieve via the exposed APIs.  This really isn't a big deal once you factor it in to your approach. Our tagging requirement came after the initial implementation was completed and it caused some problems that made us have to re-think our index scheme.

Conclusion:
I found this task very educational in terms of the Web 2.0 impact of web crawlers and the general working of the Lucene Search Engine.  I will be following up with a more detailed article including code samples once the application is released, but for now let me end this entry with some learned lessons.


- Don't write your own web crawler.  There are many that are available that can be altered to suite most purposes.  It may seem simple, but once you get into the task, you will realize it is not for the faint of heart.  Keep in mind that with Web 2.0, web crawlers are going to become even more complicated to design.

- Do use a robots.txt file to steer external search engines to the relevant content to be indexed.  You most likely don't want irrelevant data, like pages that perform cart functions to be indexed.

- Do use meta tags on your relevant pages that are to be indexed (steered by robots.txt) so the search engine knows exactly what to index.  If you let the web crawler try to figure it out, there is a strong possibility that it will be wrong.  This is even more important if your site uses advanced Web 2.0 features to retrieve content.

- Completely work out your indexing scheme, including updates and tagging if applicable, before you finalize your design.  All the item's data to be indexed must be available to properly re-create the index in the event of update.

- Don't store you indexes where they can be wiped out by a updated version of the application.  Also make sure that the domains server.policy file grants the application read/write access on the directory you have chosen.



I hope this entry helps someone else save time in their development of an indexing approach - Good Luck - Mark


Posted by basler Mar 10 2006, 02:18:03 PM PST Permalink

Comments:

Post a Comment:

Comments are closed for this entry.