Enterprise Java Community: I Love Lucene

Discuss this Article

Introduction

There are a lot of areas on TheServerSide that we would like to change. Trust us. Ever since I joined TheServerSide I have cringed at our search engine implementation. It didn’t do a good job, and that meant that our users couldn’t get to information that they wanted. User interface analysis has shown that search functionality is VERY important on the web (see: http://www.useit.com/alertbox/20010513.html), so we really had to clean up our act here.

So, we wanted a good search engine, but what are the choices? We were using ht://Dig, and having it crawl our site, building the index as it went along. This process wasn’t picking up all of the content, and didn’t give us a nice clean API for us to tune the search results. It did do one thing well, and that was searching through our news. This was a side effect of having news on the home page, which helps the rankings (the more clicks ht://Dig needed to navigate from the home page, the lower the rankings).

Although ht://Dig wasn’t going a great job, we could have tried to help it on its way. For example, we could have created a special HTML file which linked to various areas of the site, and use that as the “root” page for it to crawl. Maybe we could have put a servlet filter that checked for the ht://Dig user agent, and returned back content in a different manner (cleaning up the HTML and such).

We looked into using Google to manage our searching for us. I mean, they are pretty good at searching aren’t they? Although I am sure we could have had a good search using them, we ran into a couple of issues:

It wasn’t that easy for us (a small company) to get much information from them
For the type of search that we needed, it was looking very expensive
We still have the issues of a crawler based infrastructure

While we were looking into Google, I was also looking at Lucene. Lucene has always interested me, as it isn’t a typical open source project. In my experience, most open source projects are frameworks that have evolved. Take something like Struts. Before Struts many people were rolling their own MVC layers on top of Servlets/JSPs. It made sense to not have to reinvent this wheel, so Struts came around.

Lucene is a different beast. It contains some really complicated low level work, NOT just a nicely designed framework. I was really impressed that something of this quality was just put out there!

At first I was a bit disappointed with Lucene, as I didn’t really understand what it was. Immediately I was looking for a crawler functionality that would allow me to build an index just like ht://Dig was doing. At the time I found LARM was in the lucene-sandbox, (and have since heard of various other sub projects) but found it strange that this wouldn’t be built into the main distribution. It took me a day to realize that Lucene isn’t a product that you just run. It is a top notch search API which you can use to plugin to your system. Yes, you may have to write some code, but you also get great power and flexibility.

This cast study discusses how TheServerSide built an infrastructure that allows us to index, and search our different content using Lucene.

We will chat about:

High level infrastructure
Building the Index
- Allowing various index sources

Tweaking the weighting of various search results

Searching the Index
Configuration
- Employing XML configuration files to allow easy tweaking
Web Tier: we want to search via the web don’t we?

High level infrastructure

When you look at building your search solution, you often find that the process is split into two main tasks: building an index, and searching that index. This is definitely the case with Lucene (and the only time when this isn’t the case is if your search goes directly to the database).

We wanted to keep the search interface fairly simple, so the code that interacts from the system sees two main interfaces, IndexBuilder, and IndexSearch:

Any process which needs to build an index goes through the IndexBuilder. This is a simple interface which gives you two entrance points to the indexing process:

By passing individual configuration settings to the class
1. E.g. path to the index, if you want to do an incremental build, and how often to optimize the Lucene index as you add records.
By passing in an index plan name
1. IndexBuilder will then look up the settings it needs from the configuration system. This allows you to tweak your variables in an external file, rather than code.

You will also see a main(..) method. We created this to allow for a command line program to kick off a build process.

Index Sources

The Index Builder abstracts the details of Lucene, and the Index Sources that are used to create the index itself. As we will see in the next section, TheServerSide has various content that we wanted to be able to index, so a simple design is used where we can plug ‘n play new index sources.

The search interface is also kept very simple. A search is done via:

IndexSearch.search(String inputQuery, int resultsStart,
                int resultsCount);

e.g. Look for the terms EJB and WebLogic, returning up to the first 10 results:

IndexSearch.search("EJB and WebLogic", 0, 10);

The query is build via the Lucene QueryParser (or as a subclass that we created, which you will see in detail later). This allows our users to input typical Google-esque queries. Once again, a main() method exists to allow for command line searching of indexes.

Building the Index: Details of the index building process

We have seen that the external interface to building our search index is the class IndexBuilder. Now we will discuss the index building process, and the design choices that we made.

What fields should compromise our index?

We wanted to create a fairly generic set of fields that our index would contain. We ended up with the following fields:

We created a simple Java representation of this data, SearchContentHolder, which our API uses to pass this information around. It contains the modified and created dates as java.util.Date, and the full contents are stored as a StringBuffer rather than a String. This was refactored into our design, as we found that some IndexSources contained a lot of data, and we didn’t want to follow the path of:

Get some data from source ? Add to some String ? Pass entire String to

What types of indexing?

Since the TSS content that we wanted to index is both a) fairly large, and b) a lot of it doesn’t change, we wanted to have the concept of incremental indexing, as well as a full indexing from scratch. To take care of this we have an incrementalDays variable which is configured for the index process. If this value is set to 0 or less, then do a full index. If this is not the case, then content that is newer (created / modified) than “today – incrementalDays” should be indexed. In this case, instead of creating a new index, we simply delete the record (if it already exists) and insert the latest data into it.

How do you delete a record in Lucene again? We need the org.apache.lucene.index.IndexReader. Here is the snippet that does the work:

IndexReader reader = null;
try {
	this.close(); // closes the underlying index writer

	reader = IndexReader.open(SearchConfig.getIndexLocation());
	Term term = new Term("path", theHolder.getPath());
	reader.delete(term);
} catch (IOException e) {
	... deal with exception ...
} finally {
	try { reader.close(); } catch (IOException e) { /* suck it up */ }
}

this.open(); // reopen the index writer

Listing 1: Snippet from IndexHolder which deletes the entry from the index if it is already there

As you can see, we first close the IndexWriter, then we open the index via the IndexReader. The “path” field is the ID that corresponds to this “to be indexes” entry. If it exists in the index, it will be deleted, and shortly after we will re-add the new index information.

What to index?

As TheServerSide has grown over time, we have the side effect of possessing content that lives in different sources. Our threaded discussions lie in the database, but our articles live in a file system. The Hard Core Tech Talks also sit on the file system, but in a different manner than our articles.

We wanted to be able to plug in different sources to the index, so we created a simple IndexSource interface, and a corresponding Factory class which returns all of the index sources to be indexed.

public interface IndexSource {
	public void addDocuments(IndexHolder holder);
}

Listing 2: Shows the simple IndexSource interface

As you can see, there is just one method, addDocuments(), which an IndexSource has to implement. The IndexBuilder is charged with calling this method on each IndexSource, and passing in an IndexHolder. The responsibility of the IndexHolder is in wrapping around the Lucene specific search index (via the Lucene: org.apache.lucene.index.IndexWriter). The IndexSource is responsible for taking this holder, and adding records to it in the index process.

Let’s look at an example of how an IndexSource does this by looking at the ThreadIndexSource.

ThreadIndexSource

This index source goes through the TSS database, and indexes the various threads from all of our forums. If we are doing an incremental build, then the results are simply limited by the SQL query that we issue to get the content.

When we get the data back from the database, we need to morph it into an instance of n SearchContentHolder. If we don’t have a summary, then we simply crop the body to a summary length governed by the configuration.

The main field that we search is “fullcontents”. To make sure that a user of the system finds what it wants, we make this field NOT only the body of a thread message, but rather a concatenation of the title of the message, the owner of the message, and then finally the message contents itself. You could try to use Boolean queries to make sure that a search finds a good match, but we found it a LOT simpler to put in a cheeky concatenation!

So, this should show how simple it is to create an IndexSource. We created sources for articles and tech talks (and in fact a couple of versions to handle an upgrade in content management facilities). If someone wants us to search a new source, we create a new adapter and we are in business.

How to tweak the ranking of records

When we handed the IndexHolder a SearchContentHolder, it does the work of adding it to the Lucene index. This is a fairly trivial task of taking the values from the object and adding them to a Lucene document:

doc.add(Field.UnStored("fullcontents", theHolder.getFullContents()));
doc.add(Field.Keyword("owner", theHolder.getOwner()));

Listing 3: Adding a couple of fields from the SearchContentHolder

There is one piece of logic that goes above and beyond munging the data to a Lucene friendly manner. It is in this class that we calculate any boosts that we want to place on fields, or the document itself. It turns out that we end up with the following boosters:

The date boost has been really important for us. We have data that goes back for a long time, and seemed to be returning “old reports” too often. The date-based booster trick has gotten around this, allowing for the newest content to bubble up.

The end result is that we now have a nice simple design which allows us to add new sources to our index with minimal development time!

Searching the index

Now we have an index. It is built from the various sources of information that we have, and is just waiting for someone to search it.

Lucene made this very simple for us to whip up. The innards of searching are hidden behind the IndexSearch class, as mentioned in the high level overview. The work is so simple that I can even paste it here:

public static SearchResults search(String inputQuery, int resultsStart, 
		int resultsCount) throws SearchException {
	try {
		Searcher searcher = new
                    IndexSearcher(SearchConfig.getIndexLocation());
		String[] fields = { "title", "fullcontents" };

		Hits hits = searcher.search(CustomQueryParser.parse(inputQuery,
				fields, new StandardAnalyzer()));

		SearchResults sr = new SearchResults(hits, resultsStart, resultsCount);
		searcher.close();
		return sr;
	} catch (...) {
    		throw new SearchException(e);
	}
}

Listing 4: IndexSearch.search(…) method contents

As you can see, this method simply wraps around the Lucene IndexSearcher, and in turn envelopes the results as our own SearchResults.

The only slightly different item to note is that we created out own simple QueryParser. The CustomQueryParser extends Lucene’s, and is built to allow a default search query to search BOTH the title, and fullcontents fields. It also disables the useful, yet expensive wildcard and fuzzy queries. The last thing we want is for someone to do a bunch of queries such as ‘a*’, causing a lot of work in the Lucene engine.

Here is the class in its entirety:

public class CustomQueryParser extends QueryParser
{
	/**
	 * Static parse method which will query both the title and the
				fullcontents fields via a BooleanQuery
	 */
	public static Query parse(String query, String[] fields, Analyzer
				analyzer) throws ParseException {
       BooleanQuery bQuery = new BooleanQuery();

       for (int i = 0; i < fields.length; i++) {
           QueryParser parser = new CustomQueryParser(fields[i], analyzer);
           Query q = parser.parse(query);
           bQuery.add(q, false, false); // combine the queries, neither
				requiring or prohibiting matches
       }
       return bQuery;
       }

	public CustomQueryParser(String field, Analyzer analyzer) {
		super(field, analyzer);
	}

	final protected Query getWildcardQuery(String field, String term) throws
				ParseException {
		throw new ParseException("Wildcard Query not allowed.");
	}

	final protected Query getFuzzyQuery(String field, String term) throws
				ParseException {
		throw new ParseException("Fuzzy Query not allowed.");
	}
}

Listing 5: CustomQueryParser.java contents

That’s all folks. As you can see it is fairly trivial to get the ball rolling on the search side of the equation.

Configuration: One place to rule them all

There have been settings in both the indexing process, and search process, that were crying out for abstraction. Where should we put the index location, the category lists, the boost values, and register the index sources? We didn’t want to have this in code, and since the configuration was hierarchical we resorted to using XML.

Now, I don’t know about you, but I am not a huge fan of the low level APIs such as SAX and DOM (or even JDOM, DOM4j, and the like). In cases like this we don’t care about parsing at this level. I really just want my configuration information, and it would be perfect to have this information given to me as an object model. This is where tools such as Castor-XML, JIBX, JAXB, and Jakarta Commons Digester come in.

We opted for the Jakarta Digester in this case. We created the object model to hold the configuration that we needed, all behind the SearchConfig façade. This façade held a Singleton object which held the configuration via:

/**
 * Wrap around a Singleton instance which holds a ConfigHolder
 * @return
 */
public synchronized static ConfigHolder getConfig() {
	if (ourConfig == null) {
		try {
		String configName = "/search-config.xml";
		File input = new File( PortalConfig.getSearchConfig()
				+ configName);
		File rules = new File( PortalConfig.getSearchConfig()
				+ "/digester-rules.xml" );

		Digester digester = DigesterLoader.createDigester( rules.toURL() );

		ourConfig = (ConfigHolder) digester.parse( input );
		} catch( ... ) {
		}
	}
	return ourConfig;
}

Listing 6: SearchConfig.getConfig() static method to load in the search configuration

This method tells the tale of Digester. It takes the XML configuration file (search-config.xml), the rules for building the object model (digester-rules.xml), throw them in a pot together, and you end up with the object model (ourConfig).

XML Configuration File

The config file drives the index process, and aids the search system. To “register” a particular index source, simply add an entry under the <index-source> element. Here is an example of our configuration:

<search-config>

	<!-- The path to where the search index is kept -->
	<index-location windows="/temp/tss-searchindex" unix="/tss/searchindex" />

	<!-- Starting year of content which is indexed -->
	<beginning-year>2000</beginning-year>

	<!-- Information on search results -->
	<search-results results-per-page="10" />

	<!-- Index Plan Configuration -->
	<index-plan name="production-build">
		<optimize-frequency>400</optimize-frequency>
	</index-plan>

	<index-plan name="test-build">
		<optimize-frequency>0</optimize-frequency>
	</index-plan>

	<index-plan name="daily-incremental">
		<incremental-build>1</incremental-build>
		<optimize-frequency>0</optimize-frequency>
	</index-plan>

	<!-- Category Config Mapping -->
	<categories>
		<category number="1" name="news"        boost="1.3" />
		<category number="2" name="discussions" boost="0.6" />
		<category number="3" name="patterns"    boost="1.1" />
		<category number="4" name="reviews"     boost="1.08"/>
		<category number="5" name="articles"    boost="1.1" />
		<category number="6" name="talks"       boost="1.0" />
	</categories>

	<!-- Boost Value Configuration -->
	<boost date-base-amount="1.0" date-boost-per-count="0.02" title="2.0"
				summary="1.4" />

	<!-- List all of the Index Sources -->
	<index-sources>

		<thread-index-source summary-length="300"
				class-name="com.portal.util.search.ThreadIndexSource">
			<excluded-forums>
				<forum>X</forum>
			</excluded-forums>
		</thread-index-source>

		<article-index-source class-name="com.portal.util.search.ArticleIndexSource"
				directory="web/tssdotcom/articles" category-name="articles"
				path-prefix="/articles/article.jsp?l=" default-creation-date="today"
				default-modified-date="today" />

	</index-sources>

</search-config>

Listing 7: Sample search-config.xml file

If you peruse the file you see that now we can tweak the way that the index is built via elements such as <boost>, the <categories>, and information in <index-sources>. This flexibility allowed us to play with various boost settings, until it “felt right”.

Digester Rules File

How does the Digester take the search-config.xml and KNOW how to build the object model for us? This magic is done with a Digester Rules file. Here we tell the Digester what to do when it comes across a given tag.

Normally you will tell the engine to do something like:

Create a new object IndexPlan when you find an <index-plan>
Take the attribute values and call set methods on the corresponding object
- E.g. category.setNumber(…), category.setName(…), etc

Here is a snippet of the rules that we employ:

<?xml version="1.0"?>

<digester-rules>
<!-- Top Level ConfigHolder Object -->
	<pattern value="search-config">
		<object-create-rule
				classname="com.portal.util.search.config.ConfigHolder" />
		<set-properties-rule/>
	</pattern>

<!-- Search Results -->
	<pattern value="search-config/search-results">
		<set-properties-rule>
			<alias attr-name="results-per-page"
					prop-name="resultsPerPage" />
		</set-properties-rule>
	</pattern>

<!-- Index Plan -->
	<pattern value="search-config/index-plan">
		<object-create-rule
				classname="com.portal.util.search.config.IndexPlan" />
		<bean-property-setter-rule pattern="incremental-build"
				propertyname="incrementalBuild" />
		<bean-property-setter-rule pattern="optimize-frequency"
				propertyname="optimizeFrequency" />
		<set-properties-rule/>
		<set-next-rule methodname="addIndexPlan" />
	</pattern>

... more rules here ...

</digester-rules>

Listing 8: A snippet of the digester-rules.xml

All of the rules for the digester are out of scope of this paper, yet you can probably guess a lot from this snippet. For more information check out <insert links here>.

So, thanks to another open source tool we were able to create a fairly simple yet powerful set of configuration rules for our particular search needs. We didn’t have to use an XML configuration route, but it allows us to be flexible. If we were REALLY good people, we would have refactored the system to allow for programmatic configuration. To do that nicely it would be fairly trivial. We would have a configuration interface, and use Dependency Injection (IoC) to allow the code to setup any implementation (one being the XML file builder, the other coming from manual coding).

Web Tier: TheSeeeeeeeeeeeerverSide?

At this point we have a nice clean interface into building an index, and searching on one. Since we need users to search the content via a web interface, the last item on the development list was to create the web layer hook into the search interface.

TheServerSide portal infrastructure uses a “home grown” MVC web tier. It is home grown purely because it was developed before the likes of Struts, WebWork, or Tapestry. Our system has the notion of Actions (or as we call them, Assemblers), so to create the web glue we had to:

Create a web action: SearchAssembler.java
Create a web view: The search page and results

SearchAssembler Web Action

The web tier action is responsible for taking the input from the user, passing through to IndexSearch.search(…), and packaging the results in a format ready for the view.
There isn’t anything at all interesting in this code. We take the search query input for the user and build the Lucene query, ready for the search infrastructure. What do I mean by “build the query”? Simply put, we add all of the query information given by the user into one Lucene query string.

For example, if the user typed “Lucene” in the search box, and selected a date “after Jan 1 2003”, and narrowed the search categories to “news” we would end up building:

Lucene AND category:news AND modifieddate_range:[20040101 TO 20100101]

So our code contains small snippets such as:

if (dateRangeType.equals("before")) {
	querySB.append(" AND modifieddate_range:[19900101 TO " + dateRange + "]");
} else if (dateRangeType.equals("after")) {
	querySB.append(" AND modifieddate_range:[" + dateRange + " TO 20100101]");
}

Listing 9: Example building part of the query string. In this case the date range.

Search View

The view technology that we use is JSP (again, for legacy reasons). We use our MVC to make sure that Java code is kept out of the JSPs themselves. So, what we see above is basically just HTML with a couple of JSP tags here and there.

The one piece of real logic is when there are multiple results (see below). Here we have to do some Math to show the result pages, what page you are on, etc. This should look familiar to pagination in Google and the like. The only difference is that we always show the first page, as we have found that MOST of the time, page 1 is really what you want. This is where we could have really copied Google and placed TheSeeeeeeeeeerverside along the pages.

So, you have hopefully seen that the web tier is clean and kept as thin as possible. We leverage the work done in the IndexBuild and IndexSearch high level interfaces to Lucene.

Conclusion

You have seen all of the parts and pieces of TheServerSide search subsystem. We leveraged the power of Lucene, yet expose an abstracted search view. If we had to support another search system then we could plug that in behind the scenes, and the “users” of the search packages wouldn’t be affected.

Having said that, we don’t see any reason to move away from Lucene. It has been a pleasure to work with, and is one of the best pieces of open source software that I have personally ever worked with.

TheServerSide search used be a weak link on the site. Now it is a powerhouse. I am constantly using it as Editor, and now manage to find exactly what I want.

Indexing our data is so fast, that we don’t even need to run the incremental build plan that we developed. At one point we mistakenly had an IndexWriter.optimize() call every time we added a document. When we relaxed that to run less frequently we brought down the index time to a matter of seconds. It used to take a LOT longer, even as long as 45 minutes.

So to recap: We have gained relevance, speed, and power with this approach. We can tweak the way we index and search our content with little effort.

Thanks SO much to the entire Lucene team.

Please go check out the Lucene in Action book, click here.

PRINTER FRIENDLY VERSION