
Search on corporate intranets is difficult, often because algorithms
based on
page rank
don't work particularly well. I
n short PageRank is a “vote”, by all
the other pages on the Web, about how important a page is. A
popular document on the corporate intranet may have very few pages linking to it. Without page rank the
search results are ordered largely by frequency of key words, meta
data, and currency. This makes it almost impossible to find a given
page with a popular or overloaded term in the title such as Solaris 10, Cloud Computing, Identity Management, etc.
SunSpace, Sun's internal enterprise wiki, with an integrated document
repository, is based around the notion of
Community
Equity. Each person, and each document is assigned an equity value.
A document's Information Equity is mostly based on:
- Hits or downloads.
- Updates.
- Number of different people accessing/updating the page. (very
important in a collaborative wiki!)
- Currency. The equity decreases over time.
So how does this fit into search ... ?
SunSpace search has a 3 tier architecture. The back end is a commercial
search engine that indexes the content (wiki pages and uploaded documents) and delivers the search results. We spend a lot of time tuning this engine so that the optimal
weighting is given to titles, urls, keywords, and various meta data.
The middle tier is a set of feed readers that monitor all the updates
on SunSpace. The feed readers create "stubfiles" for every document
that contain all the meta data, which includes equity,
tags, creator, and last updated date. The feed readers run continuously and
notify the back end whenever a page is created or updated. A big
advantage is that new pages and updates are added to the search index
immediately, no more waiting days until the crawler finds the new
documents.
The front end is where most of the action is. When a user types in a
search query it is submitted to the back end and 100 results, in xml format, are
returned. The stubfile for each result in then read. The initial
results ordering is directly from the back end search engine, but the
user is given the option to Sort by Information equity, or Sort by
Date. We also display other relevant information such as tag clouds and communities for these 100 results. Each document has a creator, who is
generally a primary contributor to the page. We assemble a list of the
creators for the results, then credit each creator with the Information equity of the result they created - and display a list sorted by
sum
of information equity.
A few simple use cases:
- Find a new document. I am constantly creating documents on
SunSpace, and generally am not careful with the titles. Two hours later
when I have forgotten the title and need to forward the URL to a
colleague, I simple search for myself, then Sort by Date.
- Find a popular document. The second case refers back to
the beginning of this article. Let's say a colleague mentions a cool
wiki page on Solaris 10 that's all the buzz. I'd search for "Solaris
10", then sort by Information Equity.
- Find the expert. I have a big presentation on Cloud
Computing, and need to seek the advice of a knowledgeable colleague. I
search for Cloud Computing, then refer to the right of the results page
for the people with the most Information equity. (for that particular search.)