Building an Anti-CMS

At last weekend’s PHP North West Conference I delivered a talk titled Building an Anti-CMS (and how it’s changed our web team).  Feedback has generally been pretty positive so I thought I’d open it up to a bit of constructive criticism from inside the sector (because every web team reads our blog, right?!).

Video from the talk itself is due out within the next month but I re-recorded some audio to turn it into a slidecast to make it a bit more useful:

I’ve given a number of talks before at Edge Hill, at BarCamps and at IWMW but for PHPNW I’ve tried to further develop my style of presentation. Over the last 6 weeks I’ve watched quite a few “Lessig style” talks – making use of lots of short sentences and pictures and not being afraid to have nothing on the screen.

It leads to a massive slidedeck – 86 slides for 13 minutes – and there’s far less room to ad lib but it gets away from some of the things that annoy me about regular death by powerpoint. I’ll let you make up your own mind whether it’s worked!

Create a better search engine than Google

The findings and opinions contained within this post are entirely mine and do not necessarily reflect those of Edge Hill University. The research and write up was done in my own time and is only posted here because it may be of interest to the HE community.

At BarCamp Liverpool I gave a talk about site search. It covered the same sort of topics as Martin Belam’s Euro IA Summit presentation Taking the ‘Ooh’ out of Google and I recommend that anyone interested in site search read his series of blog posts. Go read it now – it’s way better than I’m going to write here! While Martin reviewed news websites from across Europe, I’ve turned my attention to university websites and look at what institutions are doing now and how it can be improved.

Do we really need site search?

Before we get too tied up in what to do, it’s perfectly valid to question whether we even need a site search engine any more. Every internet user knows how to search – specifically, eveyone knows how “to Google”. The average query length has almost doubled over the last few years and there is some evidence to show people are using site names to restrict their search – for example “english pgce edge hill”.

Google have also introduced “search this site” boxes into search results which mean that when you search for “Edge Hill” you can filter down your results by making an extra query. Putting “english pgce” into this search box gives results for the query “english pgce site:edgehill.ac.uk”. This feature is already present for most HEIs:

Google Site Search for Edge Hill University

So how many people are using this feature? Our Google Analytics reports show that in the last six months, just 150 visitors to our site came from searches including “site:edgehill.ac.uk”. A much higher number of people are including “Edge Hill” in queries. Excluding searches which are looking diretly for us – for example “Edge Hill University”, “www.edgehill.ac.uk”, or worse or all “edgehill uni” – around 10% of Google referals come with some form of restriction to search our site.

How does this compare to our own site search? We’ve had 270,000 searches in the last six months, with 7% of visitors using it. Additionally, the point where people are wanting to search is usually after they’ve left Google and come to our site. Someone looking for courses for example isn’t going to go back to Google to search – they need seach within the context of the pages they’re looking at.

State of the Union Universities

One weekend in November I spent a few hours going through HEI websites and testing their search engines. On each site I searched for “computing” – most universities offer some form of computing course and it would typically find their IT Services department. I noted down which system they used for search, and whether they provided the ability to search just courses, news and events. All quite basic but it has still offered up some interesting findings.

Google Search Appliance by Adriaan BloemOver the last few years most universities have looked to Google to provide search. 63% of HEIs now use some form of Google-powered search engine. Most of those have bought a Google Search Appliance (or Google Mini) – a server you install locally which provides your own miniture Google Search engine. The interface and results are familiar for users – it’s clean and quick.

Others are using either Google Custom Search Engine or what I’ve recorded as “Google Syndicated Search” – both similar services where Google will spider your site remotely with no requirements to run your own servers. CSE allows you to embed the results within your own page while Syndicated Search allows basic branding to be applied to results displayed on a Google domain.

In a distant second place is ht://Dig – a system that’s been around for years. When I first came across it many years go I was amazed that I could run my own search engine. It was one of the first widely deployed site search systems and so a number of universities were early adoptors. It hasn’t moved with the times – the latest release of ht://Dig was in June 2004.

The short tail includes Ultraseek, Egothor and Novell’s QuickFinder – the system that powered Edge Hill’s site search until April this year. A few sites use search engines built into their CMS or custom built but it’s hard to find out much about these.

This is all very good but in many ways, the engine doesn’t matter – it’s what you do with it that counts!

The ugly, the bad and the good

Google has led the way with clean, uncluttered results pages but some search engines haven’t learnt that less is more and seem compelled to pack in every feature they can. Star ratings or other indicators of the quality of the page add little and can be distracting. Users know to expect that if a page is higher up the list then it’s a closer match to their query so it may not be necessary. On our own search we have a percentage relevance but I’m not sure it adds much value to the visitor so it might be getting cut soon.

The above example from Keele also shows another problem with some search engine blindly indexing the full content of pages. We know exactly which parts of a page are relevant and which are navigation bars, headers or footers. Including these in the index can distort results while adding little value. A search for “Freedom of Information” might return 10,000 pages simply because it’s linked from the footer of every page.

Worse than indexing these areas is displaying the text as part of the result summary. Here the first line of every result is the navigation.

Sussex University searchReturning an unfamiliar layout for the results page can lead to confusion. The University of Sussex have a combined results page showing “top matches”, news, events, and “full text” stacked in one page.

Advanced search

Many site search engines offer some form of “advanced” search and this can be a very useful feature for power users to track down more precisely what they’re looking for. This extra functionality inevitably comes with the risks of extra complexity. With Google Search Appliances the advanced search page is similar to the one on the main Google site and provides some options that might not be necessary. If the entire site is on a single domain or the only language is English, why do you need the option to restrict searches by either of these (maybe this is a question Google should be answering because they can tell from the index if there’s multiple domains or languages).

Topic specific searches provide potential for a whole extra level of complexity. Nottigham Trent University’s course finder requires you to enter level, mode and year of study, subject area and keywords. Why not just a simple box?

Advanced search can be very powerful and easy to use. The University of Warwick have just a few additional options, specifying dates using a drop down rather than a datepicker, and allowing searches to be limited to certain types of document or areas of the site. Edge Hill’s course search has just one “advanced” option – course type.

Tabbed Search

This is something Martin Belam urges caution over. The user may not understand the structure behind this scoping of search and thus limit the results more than they intended. Edge Hill uses tabs to provide scoped search for courses, news and events. Warwick also has a tabbed interface to pull in search results from their blogs and people finder. Care must be taken to ensure that scoping isn’t confused with site navigation and it makes sense to show visually where there are more results. Warwick use some nifty Ajax to load up the number of search results for each tab – an idea I liked so much I implemented it on the Edge Hill too!

More neat ideas

News search on the University of Oxford is integrated right into their news system – results are presented in the same format as the news homepage showing the latest stories making the results pages familiar. Each summary is accompanied by a thumbnail and icons flag up news stories with audio or video attached.

Everyone has a custom 404 page these days to help direct users to what they’re looking for (you do, don’t you?!) so why are most search engines unhelpful when there’s no results? At best they’ll correct your spelling and provide some hints on rewording your query – at worst they’ll just dump “No results found” to screen. This is turning visitors away when they most need our help. Most sites have an A to Z list or sitemap; prospective students can be helped by linking to the full list of courses.

Auto suggest for popular queries can be a good way to encourage users to search more acurately. I’ve not seen anything similar on HEI sites yet but Facebook’s search does a neat job of integrating specific people, groups and organisations with wider search – selecting these takes yuu direct to that page – a really handy quick links feature.

Northumbria’s search tag cloud is a little out of the way but it would be possible to add in a top searchs list somewhere a bit more visible. Storing a user’s search history would be very easy to do (even client side) and with a little logging enabled it would also be possible to implement “people who searched for X also searched for Y” functionality.

The final neat feature I came across was a few sites making use of Google’s KeyMatch feature. This is similar to Google AdWords but allows the site owner to specify what should show up for specified keywords. During my presentation someone suggested that users would blank these results because they see them as adverts. I’m not so sure – I think people will click the links if it’s clear where it’s going to. KeyMatch is a good way of making sure that important pages are at the top of the results even when the search algorithm doesn’t rank them.

All talk?

By now I’m sure you’re fed up of my holier-than-thou ranting about site search and I want to stress that what we’ve done at Edge Hill isn’t perfect by any means! The algorithm used for indexing and searching is quite crude in places but I think we’ve been able to improve the user experience by adding a few neat features and trying to keep in mind what the user is likely to search for, not just what a spider can find.

The search system we use is built on Zend Lucene and has had around a week of development hours over the last year but not everyone will be able to do that, so what can be done with existing resources? In most systems, changing templates to remove unnecessary features of alter the way results are formatted is quite trivial. Salford University have just launched a new site search powered by a Google Search Appliance and their results pages show what can be done with some clever XSLT – gone is the advanced search and look out for the nice file type icons using image replacement.

Finally, one last plug for Martin Belam’s Taking the ‘Ooh’ out of Google – he shows loads of simple ways to improve your results and it’s essential reading for anyone interested in site search.

Choice Part 6: Lucene in the sky with diamonds

Search is one of the key ways that visitors find what they’re looking for on our websites. A good search engine can quickly and acurately direct the user to the right place and make for a more efficient and productive experience.

In the past we’ve used Novell’s QuickFinder search service to spider the site, supplemented by a couple of custom search systems for things like courses. I’ve never been entirely happy with the results that QuickFinder provides.

Recently in Higher Education and beyond, there has been a trend towards Google’s search appliance and their hosted solutions. Both of these are excellent in terms of raw power – they will happily index every page on a site and searches are quick and mostly relevant. But there’s more to a good search engine than the size of the index – they must provide the results you’re looking for and present them in an easy to understand way. Here’s a fairly typical example of the top search result for a search for “Computing” (I’ve removed identifying names!):

The University of Somewhere

For Edge Hill it’s important that prospective students are able to find what they’re looking for. So in the above example it’s good that it has picked a page about the academic department rather than what at Edge Hill would be IT Services, but it’s actually the Faculty page giving the briefest of details. The summary doesn’t help at all – the spider has picked up details from the page header including the alternative text from the logo and the breadcrumb trail.

What we want are relevant results which allow the visitor to quickly identify what pages have been found with information that’s relevant to the results, not just scraped text. Some search engines are starting to do this – when Google finds videos it will show a thumbnail and allow you to play the video inline – so we can use some of these ideas when creating our own search system. Now let’s get a bit more technical!

Our website can be split into two types of information – structured and unstructured. When I say unstructured, I don’t mean that it’s hundreds of pages put online without any consideration – I’m talking about web pages of content that aren’t stored in a database. Structured information is pulled out from one of our databases – things like news, events or courses. Structured content is what most search engines find difficult because they don’t “know” what a page is all about, but we do, so we can tell our search engine what information is important and how we should represent it.

For our new website, we’ve introduced a new search system based on Zend Lucene. Lucene isn’t a full blown search engine, but it’s a library you can build on to provide full text indexing of almost anything you want. We’re using a symfony plugin which packages a lot of search functionality to allow us to index news, events, courses and other information directly from the database. We have control over what information is indexed for each type and the weightings applied to them. For example we give courses a slightly higher weighting than news.

For static content we have a custom spider which trawls all the other pages on the site and adds them to the index. This work like any other search engine, following links and determining which text is relevant. We try to exclude the header, footer and navigation from the index as this contains text which is common to many pages and adds little to the value of the page.

Edge Hill’s computing search resultWe can also do a lot with the search interface itself. Firstly, different types of result show different information. For example a course result shows the UCAS code, qualification, which campuses it runs at and allows the course to be added to the My Courses basket for comparison. News and events shows similar custom results while static pages show the usual snippet of text from the page, but without irrelevant text from outside the content area creeping in.

Overall the new search seems to be working quite well – we’re able to embed it into the rest of the site more than we’ve done in the past and provide custom search boxes for courses and news. There’s still work to do on it though to improve the accuracy of results, so if you’ve tried the search and not found what you were looking for easily, please let us know.

>