January 2005
Introduction
There are a lot of areas on TheServerSide that we would like to change. Trust
us. Ever since I joined TheServerSide I have cringed at our search engine implementation.
It didn’t do a good job, and that meant that our users couldn’t
get to information that they wanted. User interface analysis has shown that
search functionality is VERY important on the web (see: http://www.useit.com/alertbox/20010513.html),
so we really had to clean up our act here.
So, we wanted a good search engine, but what are the choices? We were using
ht://Dig, and having it crawl our site, building the index as it went along.
This process wasn’t picking up all of the content, and didn’t give
us a nice clean API for us to tune the search results. It did do one thing well,
and that was searching through our news. This was a side effect of having news
on the home page, which helps the rankings (the more clicks ht://Dig needed
to navigate from the home page, the lower the rankings).
Although ht://Dig wasn’t going a great job, we could have tried to help
it on its way. For example, we could have created a special HTML file which
linked to various areas of the site, and use that as the “root”
page for it to crawl. Maybe we could have put a servlet filter that checked
for the ht://Dig user agent, and returned back content in a different manner
(cleaning up the HTML and such).
We looked into using Google to manage our searching for us. I mean, they are
pretty good at searching aren’t they? Although I am sure we could have
had a good search using them, we ran into a couple of issues:
- It wasn’t that easy for us (a small company) to get much information
from them
- For the type of search that we needed, it was looking very expensive
- We still have the issues of a crawler based infrastructure
While we were looking into Google, I was also looking at Lucene. Lucene has
always interested me, as it isn’t a typical open source project. In my
experience, most open source projects are frameworks that have evolved. Take
something like Struts. Before Struts many people were rolling their own MVC
layers on top of Servlets/JSPs. It made sense to not have to reinvent this wheel,
so Struts came around.
Lucene is a different beast. It contains some really complicated low level
work, NOT just a nicely designed framework. I was really impressed that something
of this quality was just put out there!
At first I was a bit disappointed with Lucene, as I didn’t really understand
what it was. Immediately I was looking for a crawler functionality that would
allow me to build an index just like ht://Dig was doing. At the time I found
LARM was in the lucene-sandbox, (and have since heard of various other sub projects)
but found it strange that this wouldn’t be built into the main distribution.
It took me a day to realize that Lucene isn’t a product that you just
run. It is a top notch search API which you can use to plugin to your system.
Yes, you may have to write some code, but you also get great power and flexibility.
This cast study discusses how TheServerSide built an infrastructure that allows
us to index, and search our different content using Lucene.
We will chat about:
- High level infrastructure
- Building the Index
- Allowing various index sources
- Tweaking the weighting of various search results
- Searching the Index
- Configuration
- Employing XML configuration files to allow easy tweaking
- Web Tier: we want to search via the web don’t we?
High level infrastructure
When you look at building your search solution, you often find that the process
is split into two main tasks: building an index, and searching
that index. This is definitely the case with Lucene (and the only time when
this isn’t the case is if your search goes directly to the database).
We wanted to keep the search interface fairly simple, so the code that interacts
from the system sees two main interfaces, IndexBuilder,
and IndexSearch:
Any process which needs to build an index goes through the IndexBuilder. This
is a simple interface which gives you two entrance points to the indexing process:
- By passing individual configuration settings to the class
- E.g. path to the index, if you want to do an incremental build, and
how often to optimize the Lucene index as you add records.
- By passing in an index plan name
- IndexBuilder will then look up the settings it needs from the configuration
system. This allows you to tweak your variables in an external file, rather
than code.
You will also see a main(..) method. We created this to allow for a command
line program to kick off a build process.
Index Sources
The Index Builder abstracts the details of Lucene, and the Index Sources that
are used to create the index itself. As we will see in the next section, TheServerSide
has various content that we wanted to be able to index, so a simple design is
used where we can plug ‘n play new index sources.
The search interface is also kept very simple. A search is done via:
IndexSearch.search(String inputQuery, int resultsStart,
int resultsCount);
e.g. Look for the terms EJB and WebLogic, returning up to the first 10 results:
IndexSearch.search("EJB and WebLogic", 0, 10);
The query is build via the Lucene QueryParser (or as a subclass that we created,
which you will see in detail later). This allows our users to input typical
Google-esque queries. Once again, a main() method exists to allow for command
line searching of indexes.
Building the Index: Details of the index building process
We have seen that the external interface to building our search index is the
class IndexBuilder. Now we will discuss the index building process, and the
design choices that we made.
What fields should compromise our index?
We wanted to create a fairly generic set of fields that our index would contain.
We ended up with the following fields:
We created a simple Java representation of this data, SearchContentHolder,
which our API uses to pass this information around. It contains the modified
and created dates as java.util.Date, and the full contents are stored as a StringBuffer
rather than a String. This was refactored into our design, as we found that
some IndexSources contained a lot of data, and we didn’t want to follow
the path of:
Get some data from source ? Add to some String ? Pass entire String to
What types of indexing?
Since the TSS content that we wanted to index is both a) fairly large, and
b) a lot of it doesn’t change, we wanted to have the concept of incremental
indexing, as well as a full indexing from scratch. To take care of this we have
an incrementalDays variable which is configured for the index process. If this
value is set to 0 or less, then do a full index. If this is not the case, then
content that is newer (created / modified) than “today – incrementalDays”
should be indexed. In this case, instead of creating a new index, we simply
delete the record (if it already exists) and insert the latest data into it.
How do you delete a record in Lucene again? We need the org.apache.lucene.index.IndexReader.
Here is the snippet that does the work:
IndexReader reader = null;
try {
this.close(); // closes the underlying index writer
reader = IndexReader.open(SearchConfig.getIndexLocation());
Term term = new Term("path", theHolder.getPath());
reader.delete(term);
} catch (IOException e) {
... deal with exception ...
} finally {
try { reader.close(); } catch (IOException e) { /* suck it up */ }
}
this.open(); // reopen the index writer
Listing 1: Snippet from IndexHolder which deletes the entry from the index
if it is already there
As you can see, we first close the IndexWriter, then we open the index via
the IndexReader. The “path” field is the ID that corresponds to
this “to be indexes” entry. If it exists in the index, it will be
deleted, and shortly after we will re-add the new index information.
What to index?
As TheServerSide has grown over time, we have the side effect of possessing
content that lives in different sources. Our threaded discussions lie in the
database, but our articles live in a file system. The Hard Core Tech Talks also
sit on the file system, but in a different manner than our articles.
We wanted to be able to plug in different sources to the index, so we created
a simple IndexSource interface, and a corresponding Factory class which returns
all of the index sources to be indexed.
public interface IndexSource {
public void addDocuments(IndexHolder holder);
}
Listing 2: Shows the simple IndexSource interface
As you can see, there is just one method, addDocuments(), which an IndexSource
has to implement. The IndexBuilder is charged with calling this method on each
IndexSource, and passing in an IndexHolder. The responsibility of the IndexHolder
is in wrapping around the Lucene specific search index (via the Lucene: org.apache.lucene.index.IndexWriter).
The IndexSource is responsible for taking this holder, and adding records to
it in the index process.
Let’s look at an example of how an IndexSource does this by looking at
the ThreadIndexSource.
ThreadIndexSource
This index source goes through the TSS database, and indexes the various threads
from all of our forums. If we are doing an incremental build, then the results
are simply limited by the SQL query that we issue to get the content.
When we get the data back from the database, we need to morph it into an instance
of n SearchContentHolder. If we don’t have a summary, then we simply crop
the body to a summary length governed by the configuration.
The main field that we search is “fullcontents”. To make sure that
a user of the system finds what it wants, we make this field NOT only the body
of a thread message, but rather a concatenation of the title of the message,
the owner of the message, and then finally the message contents itself. You
could try to use Boolean queries to make sure that a search
finds a good match, but we found it a LOT simpler to put in a cheeky concatenation!
So, this should show how simple it is to create an IndexSource. We created
sources for articles and tech talks (and in fact a couple of versions to handle
an upgrade in content management facilities). If someone wants us to search
a new source, we create a new adapter and we are in business.
How to tweak the ranking of records
When we handed the IndexHolder a SearchContentHolder, it does the work of
adding it to the Lucene index. This is a fairly trivial task of taking the values
from the object and adding them to a Lucene document:
doc.add(Field.UnStored("fullcontents", theHolder.getFullContents()));
doc.add(Field.Keyword("owner", theHolder.getOwner()));
Listing 3: Adding a couple of fields from the SearchContentHolder
There is one piece of logic that goes above and beyond munging the data to
a Lucene friendly manner. It is in this class that we calculate any boosts that
we want to place on fields, or the document itself. It turns out that we end
up with the following boosters:
The date boost has been really important for us. We have data that
goes back for a long time, and seemed to be returning “old reports”
too often. The date-based booster trick has gotten around this, allowing for
the newest content to bubble up.
The end result is that we now have a nice simple design which allows us to
add new sources to our index with minimal development time!
Searching the index
Now we have an index. It is built from the various sources of information that
we have, and is just waiting for someone to search it.
Lucene made this very simple for us to whip up. The innards of searching are
hidden behind the IndexSearch class, as mentioned in the high level overview.
The work is so simple that I can even paste it here:
public static SearchResults search(String inputQuery, int resultsStart,
int resultsCount) throws SearchException {
try {
Searcher searcher = new
IndexSearcher(SearchConfig.getIndexLocation());
String[] fields = { "title", "fullcontents" };
Hits hits = searcher.search(CustomQueryParser.parse(inputQuery,
fields, new StandardAnalyzer()));
SearchResults sr = new SearchResults(hits, resultsStart, resultsCount);
searcher.close();
return sr;
} catch (...) {
throw new SearchException(e);
}
}
Listing 4: IndexSearch.search(…) method contents
As you can see, this method simply wraps around the Lucene IndexSearcher,
and in turn envelopes the results as our own SearchResults.
The only slightly different item to note is that we created out own simple
QueryParser. The CustomQueryParser
extends Lucene’s, and is built to allow a default search query to search
BOTH the title, and fullcontents fields. It
also disables the useful, yet expensive wildcard and fuzzy queries. The last
thing we want is for someone to do a bunch of queries such as ‘a*’,
causing a lot of work in the Lucene engine.
Here is the class in its entirety:
public class CustomQueryParser extends QueryParser
{
/**
* Static parse method which will query both the title and the
fullcontents fields via a BooleanQuery
*/
public static Query parse(String query, String[] fields, Analyzer
analyzer) throws ParseException {
BooleanQuery bQuery = new BooleanQuery();
for (int i = 0; i < fields.length; i++) {
QueryParser parser = new CustomQueryParser(fields[i], analyzer);
Query q = parser.parse(query);
bQuery.add(q, false, false); // combine the queries, neither
requiring or prohibiting matches
}
return bQuery;
}
public CustomQueryParser(String field, Analyzer analyzer) {
super(field, analyzer);
}
final protected Query getWildcardQuery(String field, String term) throws
ParseException {
throw new ParseException("Wildcard Query not allowed.");
}
final protected Query getFuzzyQuery(String field, String term) throws
ParseException {
throw new ParseException("Fuzzy Query not allowed.");
}
}
Listing 5: CustomQueryParser.java contents
That’s all folks. As you can see it is fairly trivial to get the ball
rolling on the search side of the equation.
Configuration: One place to rule them all
There have been settings in both the indexing process, and search process,
that were crying out for abstraction. Where should we put the index location,
the category lists, the boost values, and register the index sources? We didn’t
want to have this in code, and since the configuration was hierarchical we resorted
to using XML.
Now, I don’t know about you, but I am not a huge fan of the low level
APIs such as SAX and DOM (or even JDOM, DOM4j, and the like). In cases like
this we don’t care about parsing at this level. I really just want my
configuration information, and it would be perfect to have this information
given to me as an object model. This is where tools such as Castor-XML, JIBX,
JAXB, and Jakarta Commons Digester come in.
We opted for the Jakarta Digester in this case. We created the object model
to hold the configuration that we needed, all behind the SearchConfig façade.
This façade held a Singleton object which held the configuration via:
/**
* Wrap around a Singleton instance which holds a ConfigHolder
* @return
*/
public synchronized static ConfigHolder getConfig() {
if (ourConfig == null) {
try {
String configName = "/search-config.xml";
File input = new File( PortalConfig.getSearchConfig()
+ configName);
File rules = new File( PortalConfig.getSearchConfig()
+ "/digester-rules.xml" );
Digester digester = DigesterLoader.createDigester( rules.toURL() );
ourConfig = (ConfigHolder) digester.parse( input );
} catch( ... ) {
}
}
return ourConfig;
}
Listing 6: SearchConfig.getConfig() static method to load in the search
configuration
This method tells the tale of Digester. It takes the XML configuration file
(search-config.xml), the rules for building the object model (digester-rules.xml),
throw them in a pot together, and you end up with the object model (ourConfig).
XML Configuration File
The config file drives the index process, and aids the search system. To “register”
a particular index source, simply add an entry under the <index-source>
element. Here is an example of our configuration:
<search-config>
<!-- The path to where the search index is kept -->
<index-location windows="/temp/tss-searchindex" unix="/tss/searchindex" />
<!-- Starting year of content which is indexed -->
<beginning-year>2000</beginning-year>
<!-- Information on search results -->
<search-results results-per-page="10" />
<!-- Index Plan Configuration -->
<index-plan name="production-build">
<optimize-frequency>400</optimize-frequency>
</index-plan>
<index-plan name="test-build">
<optimize-frequency>0</optimize-frequency>
</index-plan>
<index-plan name="daily-incremental">
<incremental-build>1</incremental-build>
<optimize-frequency>0</optimize-frequency>
</index-plan>
<!-- Category Config Mapping -->
<categories>
<category number="1" name="news" boost="1.3" />
<category number="2" name="discussions" boost="0.6" />
<category number="3" name="patterns" boost="1.1" />
<category number="4" name="reviews" boost="1.08"/>
<category number="5" name="articles" boost="1.1" />
<category number="6" name="talks" boost="1.0" />
</categories>
<!-- Boost Value Configuration -->
<boost date-base-amount="1.0" date-boost-per-count="0.02" title="2.0"
summary="1.4" />
<!-- List all of the Index Sources -->
<index-sources>
<thread-index-source summary-length="300"
class-name="com.portal.util.search.ThreadIndexSource">
<excluded-forums>
<forum>X</forum>
</excluded-forums>
</thread-index-source>
<article-index-source class-name="com.portal.util.search.ArticleIndexSource"
directory="web/tssdotcom/articles" category-name="articles"
path-prefix="/articles/article.jsp?l=" default-creation-date="today"
default-modified-date="today" />
</index-sources>
</search-config>
Listing 7: Sample search-config.xml file
If you peruse the file you see that now we can tweak the way that the index
is built via elements such as <boost>, the <categories>, and information
in <index-sources>. This flexibility allowed us to play with various boost
settings, until it “felt right”.
Digester Rules File
How does the Digester take the search-config.xml and KNOW how to build the
object model for us? This magic is done with a Digester Rules file. Here we
tell the Digester what to do when it comes across a given tag.
Normally you will tell the engine to do something like:
- Create a new object IndexPlan when you find an <index-plan>
- Take the attribute values and call set methods on the corresponding object
- E.g. category.setNumber(…), category.setName(…), etc
Here is a snippet of the rules that we employ:
<?xml version="1.0"?>
<digester-rules>
<!-- Top Level ConfigHolder Object -->
<pattern value="search-config">
<object-create-rule
classname="com.portal.util.search.config.ConfigHolder" />
<set-properties-rule/>
</pattern>
<!-- Search Results -->
<pattern value="search-config/search-results">
<set-properties-rule>
<alias attr-name="results-per-page"
prop-name="resultsPerPage" />
</set-properties-rule>
</pattern>
<!-- Index Plan -->
<pattern value="search-config/index-plan">
<object-create-rule
classname="com.portal.util.search.config.IndexPlan" />
<bean-property-setter-rule pattern="incremental-build"
propertyname="incrementalBuild" />
<bean-property-setter-rule pattern="optimize-frequency"
propertyname="optimizeFrequency" />
<set-properties-rule/>
<set-next-rule methodname="addIndexPlan" />
</pattern>
... more rules here ...
</digester-rules>
Listing 8: A snippet of the digester-rules.xml
All of the rules for the digester are out of scope of this paper, yet you can
probably guess a lot from this snippet. For more information check out <insert
links here>.
So, thanks to another open source tool we were able to create a fairly simple
yet powerful set of configuration rules for our particular search needs. We
didn’t have to use an XML configuration route, but it allows us to be
flexible. If we were REALLY good people, we would have refactored the system
to allow for programmatic configuration. To do that nicely it would be fairly
trivial. We would have a configuration interface, and use Dependency Injection
(IoC) to allow the code to setup any implementation (one being the XML file
builder, the other coming from manual coding).
Web Tier: TheSeeeeeeeeeeeerverSide?
At this point we have a nice clean interface into building an index, and searching
on one. Since we need users to search the content via a web interface, the last
item on the development list was to create the web layer hook into the search
interface.
TheServerSide portal infrastructure uses a “home grown” MVC web
tier. It is home grown purely because it was developed before the likes of Struts,
WebWork, or Tapestry. Our system has the notion of Actions (or as we call them,
Assemblers), so to create the web glue we had to:
- Create a web action: SearchAssembler.java
- Create a web view: The search page and results
SearchAssembler Web Action
The web tier action is responsible for taking the input from the user, passing
through to IndexSearch.search(…), and packaging the results in a format
ready for the view.
There isn’t anything at all interesting in this code. We take the search
query input for the user and build the Lucene query, ready for the search infrastructure.
What do I mean by “build the query”? Simply put, we add all of the
query information given by the user into one Lucene query string.
For example, if the user typed “Lucene” in the search box, and
selected a date “after Jan 1 2003”, and narrowed the search categories
to “news” we would end up building:
Lucene AND category:news AND modifieddate_range:[20040101 TO 20100101]
So our code contains small snippets such as:
if (dateRangeType.equals("before")) {
querySB.append(" AND modifieddate_range:[19900101 TO " + dateRange + "]");
} else if (dateRangeType.equals("after")) {
querySB.append(" AND modifieddate_range:[" + dateRange + " TO 20100101]");
}
Listing 9: Example building part of the query string. In this case the
date range.
Search View
The view technology that we use is JSP (again, for legacy reasons). We use
our MVC to make sure that Java code is kept out of the JSPs themselves. So,
what we see above is basically just HTML with a couple of JSP tags here and
there.
The one piece of real logic is when there are multiple results (see below).
Here we have to do some Math to show the result pages, what page you are on,
etc. This should look familiar to pagination in Google and the like. The only
difference is that we always show the first page, as we have found that MOST
of the time, page 1 is really what you want. This is where we could have really
copied Google and placed TheSeeeeeeeeeerverside along the pages.
So, you have hopefully seen that the web tier is clean and kept as thin as
possible. We leverage the work done in the IndexBuild and IndexSearch high level
interfaces to Lucene.
Conclusion
You have seen all of the parts and pieces of TheServerSide search subsystem.
We leveraged the power of Lucene, yet expose an abstracted search view. If we
had to support another search system then we could plug that in behind the scenes,
and the “users” of the search packages wouldn’t be affected.
Having said that, we don’t see any reason to move away from Lucene. It
has been a pleasure to work with, and is one of the best pieces of open source
software that I have personally ever worked with.
TheServerSide search used be a weak link on the site. Now it is a powerhouse.
I am constantly using it as Editor, and now manage to find exactly what I want.
Indexing our data is so fast, that we don’t even need to run the incremental
build plan that we developed. At one point we mistakenly had an IndexWriter.optimize()
call every time we added a document. When we relaxed that to run less frequently
we brought down the index time to a matter of seconds. It used to take a LOT
longer, even as long as 45 minutes.
So to recap: We have gained relevance, speed, and power with this approach.
We can tweak the way we index and search our content with little effort.
Thanks SO much to the entire Lucene team.
Please go check out the Lucene in Action book, click
here.
PRINTER FRIENDLY VERSION
|