November 23, 2004
Search is important! All too often search looks like where thing like '%that%'.
Users know google, and quite a few even know its query language at this point.
Aside from wanting to provide more functionality in search, users are expecting
it. Google seems simple, doesn't it?
Enter Lucene.
I'll presume you've heard of it at least, if not used it. Lucene does full text
indexing, and that is it. It does this really well. The beauty (well, one) is
that you can index anything. In this case, I'll index an object being persisted
by OJB. The key is to
embed information required to retrieve the document being indexed.
Take a gander at a fairly simple Student
class (this is frmo an app I am doing for my little brother, who is a professor
(of such terrible subjects as rock climbing and white water kayaking, don't
get me started)).
The primary use case for this application is for a student coop employee to
be finding a student in the system, then finding gear and checking the gear
out for the student. Finding the student is key, and that is best served by...
searching! So we have a database record for each student, and want to have a
convenient search facility, which can search based on name, student id (idNumber),
phone number, even address. Lucene makes this is a snap. To do it, we just store
the id (internal/pk id) in an unindexed field when we add a student in the StudentIndexer:
public void add(final Student student) throws ServiceException {
final Document doc = new Document();
doc.add(Field.Text(NAME, student.getName()));
doc.add(Field.Text(ID_NUMBER, student.getIdNumber()));
doc.add(Field.Text(ADDRESS, student.getAddress()));
doc.add(Field.Text(PHONE, student.getPhone()));
doc.add(Field.UnIndexed(IDENTITY, student.getId().toString()));
try {
synchronized (mutex) {
final IndexWriter writer = new IndexWriter(index, analyzer, false);
writer.addDocument(doc);
writer.optimize();
writer.close();
}
}
catch (IOException e) {
throw new ServiceException("Unable to index student", e);
}
}
Notice the UnIndexed field on the Document? This tells Lucene to store this
field with the record, but don't index it or search on it. When you retrieve
the document you will get the field though. Perfect place to stash the primary
key.
When we look for the students, we don't want to get back Lucene Document instances,
though, we want to go ahead and get the nice domain model instances of Student.
What we'll do is query against the index, pull all the pk's for the hits out,
then select for the domain objects using those pks (from the StudentIndex:
public List findStudents(final String search) throws ServiceException {
return this.findStudents(search, Integer.MAX_VALUE);
}
public List findStudents(final String search, final int numberOfResults) throws ServiceException {
final Query query;
try {
query = QueryParser.parse(search, StudentIndexer.NAME, analyzer);
}
catch (ParseException e) {
throw new ServiceException("Unable to make any sense of the query", e);
}
final ArrayList ids = new ArrayList();
try {
final IndexReader reader = IndexReader.open(index);
final IndexSearcher searcher = new IndexSearcher(reader);
final Hits hits = searcher.search(query);
for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {
final Document doc = hits.doc(i);
ids.add(new Integer(doc.getField(StudentIndexer.IDENTITY).stringValue()));
}
searcher.close();
reader.close();
}
catch (IOException e) {
throw new ServiceException("Error while reading student data from index", e);
}
final List students = dao.findStudentsWithIdsIn(ids);
Collections.sort(students, new Comparator() {
public int compare(final Object o1, final Object o2) {
final Integer id_1 = ((Student) o1).getId();
final Integer id_2 = ((Student) o1).getId();
for (int i = 0; i != ids.size(); i++) {
final Integer integer = (Integer) ids.get(i);
if (integer.equals(id_1)) {
return -1;
}
if (integer.equals(id_2)) {
return 1;
}
}
return 0;
}
});
return students;
}
The findStudents(string, string, int): List method is a little bit more complex
than I like as it does a few things: query against the lucene index, extract
the primary keys for the hits, query for the students matching those pk's (via
the StudentDAO),
and finally sorts the results (no way to specify the sort order in the query,
it is dependent on the order of the hits from the lucene query). With that though,
we support queries such as Tiffany, which is simple, or a more fun one, name:
Aching phone: ???-1234 or what not. Go look at the Lucene query
parser syntax. It is worth noting that the above query defaults to searching
on the name field if no specific field is specified. This seems to make sense
to me =)
If you look at the StudentIndex
and StudentIndexer
you will see there are also facilities for adding and removing documents from
the lucene index. This gets important on any insert/update/delete operation.
The update is important to catch as you need to remove the old entry and insert
a new one in the index. Doing this is best done (my opinion) via an aspect which
picks these operations out. That is outside the scope of this article though
;-)
For a larger application with more things being indexed (this just has two
searchable domain types) I might generalize the search capability via a DocumentFactory
such as:
public class BeanDocumentFactory implements DocumentFactory {
public Document build(Object entity) {
final Document document = new Document();
try {
final BeanInfo info = Introspector.getBeanInfo(entity.getClass());
final PropertyDescriptor[] props = info.getPropertyDescriptors();
for (int i = 0; i != props.length; ++i) {
final PropertyDescriptor prop = props[i];
final String name = prop.getName();
final Method reader = prop.getReadMethod();
final Object value = reader.invoke(entity, new Object[]{});
final Field field = Field.Text(name, String.valueOf(value));
document.add(field);
}
}
catch (Exception e) {
throw new RuntimeException("Handle these in real application", e);
}
return document;
}
}
But I have not needed to generalize it for a real project yet =)
Speaking of Lucene (which rocks) I am eagerly anticipating Erik
Hatcher's new book, Lucene in Action. If it is anything like Erik and and
Steve Loughran's
Java Development
with Ant Lucene will be a lucky project to have it in circulation.
About the author
Brian McCallister
Blog: http://kasparov.skife.org/blog/
Brian McCallister doesn't particularly like writing bios or writing about himself in the third person. He does love programming and systems work though, and tends to find himself doing a lot of both. Brian has also quite enjoyed giving presentations and seminars in the past, which isn't too hard as he loves teaching and exploring new ideas.
|