Choice Part 6: Lucene in the sky with diamonds

Search is one of the key ways that visitors find what they’re looking for on our websites. A good search engine can quickly and acurately direct the user to the right place and make for a more efficient and productive experience.

In the past we’ve used Novell’s QuickFinder search service to spider the site, supplemented by a couple of custom search systems for things like courses. I’ve never been entirely happy with the results that QuickFinder provides.

Recently in Higher Education and beyond, there has been a trend towards Google’s search appliance and their hosted solutions. Both of these are excellent in terms of raw power – they will happily index every page on a site and searches are quick and mostly relevant. But there’s more to a good search engine than the size of the index – they must provide the results you’re looking for and present them in an easy to understand way. Here’s a fairly typical example of the top search result for a search for “Computing” (I’ve removed identifying names!):

The University of Somewhere

For Edge Hill it’s important that prospective students are able to find what they’re looking for. So in the above example it’s good that it has picked a page about the academic department rather than what at Edge Hill would be IT Services, but it’s actually the Faculty page giving the briefest of details. The summary doesn’t help at all – the spider has picked up details from the page header including the alternative text from the logo and the breadcrumb trail.

What we want are relevant results which allow the visitor to quickly identify what pages have been found with information that’s relevant to the results, not just scraped text. Some search engines are starting to do this – when Google finds videos it will show a thumbnail and allow you to play the video inline – so we can use some of these ideas when creating our own search system. Now let’s get a bit more technical!

Our website can be split into two types of information – structured and unstructured. When I say unstructured, I don’t mean that it’s hundreds of pages put online without any consideration – I’m talking about web pages of content that aren’t stored in a database. Structured information is pulled out from one of our databases – things like news, events or courses. Structured content is what most search engines find difficult because they don’t “know” what a page is all about, but we do, so we can tell our search engine what information is important and how we should represent it.

For our new website, we’ve introduced a new search system based on Zend Lucene. Lucene isn’t a full blown search engine, but it’s a library you can build on to provide full text indexing of almost anything you want. We’re using a symfony plugin which packages a lot of search functionality to allow us to index news, events, courses and other information directly from the database. We have control over what information is indexed for each type and the weightings applied to them. For example we give courses a slightly higher weighting than news.

For static content we have a custom spider which trawls all the other pages on the site and adds them to the index. This work like any other search engine, following links and determining which text is relevant. We try to exclude the header, footer and navigation from the index as this contains text which is common to many pages and adds little to the value of the page.

Edge Hill’s computing search resultWe can also do a lot with the search interface itself. Firstly, different types of result show different information. For example a course result shows the UCAS code, qualification, which campuses it runs at and allows the course to be added to the My Courses basket for comparison. News and events shows similar custom results while static pages show the usual snippet of text from the page, but without irrelevant text from outside the content area creeping in.

Overall the new search seems to be working quite well – we’re able to embed it into the rest of the site more than we’ve done in the past and provide custom search boxes for courses and news. There’s still work to do on it though to improve the accuracy of results, so if you’ve tried the search and not found what you were looking for easily, please let us know.

PHP London 2008

A belated writeup on last week’s PHP London Conference. Andy’s already written a post so I don’t feel too bad!

As it turned out we split the sessions so I’ll just cover those Andy’s not mentioned. First up was Stefan Esser‘s PHP Binary Analysis. It was looking at using complied PHP bytecode to debug and audit your code. Probably of more use to people doing detailed security audits but some interesting ideas that I’d like to look into when I get a bit more time.

After lunch Marcus Bointon presented Mail(); & Life after Mail(). He started early on by quoting a blog post from Hacked:

I Knew How To Validate An Email Address Until I Read The RFC

Anyone’s who’s ever tried to send email using PHP’s mail() function will know the lengths you go to to get things working. Even then you’re probably doing it wrong. The solution is to use a library to handle all the standards compliance for you, something that symfony provides through the PHPMailer library.

Marcus went through a bunch more libraries and compared some of the features they provide so it will be interesting to look into what’s best for our needs.

More interesting for me was finding out about return paths. This is what happens when an email bounces and with a bit of server side magic it is possible to handle errors better. It’s quite a complex task to do properly so I’m interested in a good hosted service which can be used for both one shot emails like user registrations for batch mailshots. Apparently there’s a few services out there but I’ve not seen any with a really good API.

Final session I went to alone was My Framework Is Better Than Yours? presented by Rob Allen, Toby Beresford and Ian P. Christian. Each gave a short presentation on their framework of choice – Zend, Code Igniter and symfony – followed by a panel discussion. It was clear that each has its advantages and disadvantages:

  • Zend is good for components to pick and choose which aspects of a framework you need. It can often be used with other frameworks too. This can also be a downside is they’re maybe not quite as integrated as other systems.
  • Code Igniter is lightweight and some might like that it runs under PHP4. Personally I think this is a disadvantage. Someone in the audience suggested there was a way of turning on HP5 mode but I can’t believe this does more than activates a few extra features. Coding for PHP5 is an attitude shift and I don’t see how they’ve done this while retaining compatibility.
  • symfony, well I knew a bit about that already 😉 Pookey did a pretty good job of presenting it.

During the panel discussion there was a comment about the criminal use of the term MVC to describe the frameworks. It got the attention of the room and there’s quite a lot of talk about this on the interweb. My view is that it doesn’t really matter whether a framework sticks rigidly to some design pattern if it provides the features that you need. I’m interested in getting things done, not in the theory of system design.

That’s all from me – check out Andy’s summary of the other sessions.

>