Choice Part 6: Lucene in the sky with diamonds

Search is one of the key ways that visitors find what they’re looking for on our websites. A good search engine can quickly and acurately direct the user to the right place and make for a more efficient and productive experience.

In the past we’ve used Novell’s QuickFinder search service to spider the site, supplemented by a couple of custom search systems for things like courses. I’ve never been entirely happy with the results that QuickFinder provides.

Recently in Higher Education and beyond, there has been a trend towards Google’s search appliance and their hosted solutions. Both of these are excellent in terms of raw power – they will happily index every page on a site and searches are quick and mostly relevant. But there’s more to a good search engine than the size of the index – they must provide the results you’re looking for and present them in an easy to understand way. Here’s a fairly typical example of the top search result for a search for “Computing” (I’ve removed identifying names!):

The University of Somewhere

For Edge Hill it’s important that prospective students are able to find what they’re looking for. So in the above example it’s good that it has picked a page about the academic department rather than what at Edge Hill would be IT Services, but it’s actually the Faculty page giving the briefest of details. The summary doesn’t help at all – the spider has picked up details from the page header including the alternative text from the logo and the breadcrumb trail.

What we want are relevant results which allow the visitor to quickly identify what pages have been found with information that’s relevant to the results, not just scraped text. Some search engines are starting to do this – when Google finds videos it will show a thumbnail and allow you to play the video inline – so we can use some of these ideas when creating our own search system. Now let’s get a bit more technical!

Our website can be split into two types of information – structured and unstructured. When I say unstructured, I don’t mean that it’s hundreds of pages put online without any consideration – I’m talking about web pages of content that aren’t stored in a database. Structured information is pulled out from one of our databases – things like news, events or courses. Structured content is what most search engines find difficult because they don’t “know” what a page is all about, but we do, so we can tell our search engine what information is important and how we should represent it.

For our new website, we’ve introduced a new search system based on Zend Lucene. Lucene isn’t a full blown search engine, but it’s a library you can build on to provide full text indexing of almost anything you want. We’re using a symfony plugin which packages a lot of search functionality to allow us to index news, events, courses and other information directly from the database. We have control over what information is indexed for each type and the weightings applied to them. For example we give courses a slightly higher weighting than news.

For static content we have a custom spider which trawls all the other pages on the site and adds them to the index. This work like any other search engine, following links and determining which text is relevant. We try to exclude the header, footer and navigation from the index as this contains text which is common to many pages and adds little to the value of the page.

Edge Hill’s computing search resultWe can also do a lot with the search interface itself. Firstly, different types of result show different information. For example a course result shows the UCAS code, qualification, which campuses it runs at and allows the course to be added to the My Courses basket for comparison. News and events shows similar custom results while static pages show the usual snippet of text from the page, but without irrelevant text from outside the content area creeping in.

Overall the new search seems to be working quite well – we’re able to embed it into the rest of the site more than we’ve done in the past and provide custom search boxes for courses and news. There’s still work to do on it though to improve the accuracy of results, so if you’ve tried the search and not found what you were looking for easily, please let us know.

2 thoughts on “Choice Part 6: Lucene in the sky with diamonds

  1. Steve Martin

    I found your blog via something I was searching for about Symfony and realised that the ‘University of Somewhere’ is my university 😉 I could recognise those awful Oracle Portal addresses from 50 paces! Fair comment about the summary etc, this is something I have been meaning to look at for some time but have never found that time… I like what you are doing with Lucene though – hats off.

  2. Mike Nolan Post author

    Hi Steve,

    You weren’t the worst example I came across in my research, you were just unlucky in showing some of the typical problems search engines suffer from in the queries I was trying!

    I’ve generally been impressed with Lucene, and although it does require more initial setting up, it provides a lot of flexibility to integrate tightly with your site.


Comments are closed.