Friday March 10, 2006
Mark A. Basler's Weblog
Lucene Search Engine, Web Crawlers and Tagging...
Recently, I have been working on adding search functionality to a
soon-to-be-released JavaEE5 application that deploys on Glassfish . After
looking over the open source search engine options that were available for us to bundle with the application,
we decided on using the Lucene search engine available from Apache.
Lucene is a robust search engine that supplies APIs that enable the
developer to design an indexing scheme to match their needs. They
also follow the same methodology I strongly believe in, keep it simple
(KISS). Lucene doesn't supply the functionality that actually
walks through your data to create the indexes or the functionality to
capture the criteria to search/display the indexes, because its roll is
strictly to be a search engine. The Lucene Tutorial with the
accompanying demo applications walk through common creation/search
scenarios depicting a straightforward methodology to write your own
interfaces between your data and Lucene.
Web Crawlers:
A common desire is to have a web crawler or robot walk the web site,
indexing relevant content so it can be later searched. There are some
third party alternatives that can be used like Apache's Nutch which is
built on top of Lucene or WebSPHINX that can
be modified to store data into Lucene. These and other third party
open source solutions will help save a lot of time. If you decide
to go the web crawling route, you could try to write your own
like I did using the JavaSE5 javax.swing.text.html.parser classes, but
it is laborious and you will not be spending time focusing on the real
problem of getting your data correctly indexed. I did consider
the effort to be a very educational endeavor, but not to be attempted
by the faint of heart.
Our Approach:
Considering our needs and the fact that
our JavaEE5 application's web interface is largely based on Web 2.0 that utilizes
AJAX to present most page content. We decided to write our own
interface that pulls data from the database to create the
indexes. We found when serving page content using AJAX &
DHTML, the web crawling paradigm becomes convoluted. This is due
to the fact that it is hard for the crawler to know the relevant
content that is associated with a specific item. This is
especially true when the content is retrieved through Javascript events
that haven't been fired by the crawler, like a Javascript
mouseover. This problem can be mitigated by methodical use of the
robots.txt file
and meta tags in the HTML pages that are served to give the web crawler the correct pages with the correct data. Amazon uses this
approach by including meta tags like "description" and "keywords" coupled with a restrictive robots.txt file to
help companies like Google index their site correctly.
We are using meta tags to help external web crawlers index our site
properly, but thought our own search results should be as accurate as
possible for the items we are offering. To store the data to be
indexed it was easy to use the new persistence APIs in JavaEE5 based on
POJOs. With our
development environment consisting of Netbeans
5.0 and the Glassfish AppServer
development went very smooth. All that was required to make the
Lucene APIs available was to package it with our application. One
note, be careful where you store your indexes. If you store them
under the deployed application directory, they will be removed when you
redeploy/update your application. We decided to store the indexes
under the domains lib directory (e.g.
"/glassfish/domains/domain1/lib/indexDir"), which can be accessed using
the Glassfish System.getProperty("com.sun.aas.instanceRoot") +
"/lib/indexDir", but the location is totally up to you.
Tagging:
We also added the functionality to allow user's to add there own custom
tags to the items so they are also searchable by other users. Tagging has become
very popular and can be seen by browsing pioneering sites like del.icio.us and flickr.
We wanted the ability to weight the tags based on subsequent user
clicks, so the tag information also had to be persisted in the
database.
Updating Indexes in Lucene for Tagging:
One thing to keep in mind is that Lucene doesn't allow an index to be
updated, the specific index has to be deleted then re-created.
When adding a new tag to an item or updating a document index you have to
be able to access all the data that was originally in the index before
re-creating it. This sounds straightforward but there is on
caveat. If you index items using an approach that doesn't allow retrieval of all the data in the index, you will have to read the data from a
persistent store so the index can be completely re-created. You
can get in this state when you create a
org.apache.lucene.document.Field for the documents index utilizing the
"UnStore" method or "Text" method with a Reader. When using these
methods, the data can't be retrieve via the exposed APIs. This
really isn't a big deal once you factor it in to your approach. Our
tagging requirement came after the initial implementation was completed
and it caused some problems that made us have to re-think our index
scheme.
Conclusion:
I found this task very educational in terms of the Web 2.0 impact of
web crawlers and the general working of the Lucene Search Engine.
I will be following up with a more detailed article including
code samples once the application is released, but for now let me end
this entry with some learned lessons.
- Don't write your own web crawler. There are many that are
available that can be altered to suite most purposes. It may seem
simple, but once you get into the task, you will realize it is not for
the faint of heart. Keep in mind that with Web 2.0, web crawlers
are going to become even more complicated to design.
- Do use a robots.txt file to steer external search engines to the
relevant content to be indexed. You most likely don't want irrelevant data,
like pages that perform cart functions to be indexed.
- Do use meta tags on your relevant pages that are to be indexed
(steered by robots.txt) so the search engine knows exactly what to
index. If you let the web crawler try to figure it out, there is
a strong possibility that it will be wrong. This is even more
important if your site uses advanced Web 2.0 features to retrieve
content.
- Completely work out your indexing scheme, including updates and
tagging if applicable, before you finalize your design. All the
item's data to be indexed must be available to properly re-create the
index in the event of update.
- Don't store you indexes where they can be wiped out by a updated
version of the application. Also make sure that the domains
server.policy file grants the application read/write access on the
directory you have chosen.
I hope this entry helps someone else save time in their development of
an indexing approach - Good Luck - Mark
Posted by basler
Mar 10 2006, 02:18:03 PM PST
Permalink
Comments are closed for this entry.

