Create a better search engine than Google

The findings and opinions contained within this post are entirely mine and do not necessarily reflect those of Edge Hill University. The research and write up was done in my own time and is only posted here because it may be of interest to the HE community.

At BarCamp Liverpool I gave a talk about site search. It covered the same sort of topics as Martin Belam’s Euro IA Summit presentation Taking the ‘Ooh’ out of Google and I recommend that anyone interested in site search read his series of blog posts. Go read it now – it’s way better than I’m going to write here! While Martin reviewed news websites from across Europe, I’ve turned my attention to university websites and look at what institutions are doing now and how it can be improved.

Do we really need site search?

Before we get too tied up in what to do, it’s perfectly valid to question whether we even need a site search engine any more. Every internet user knows how to search – specifically, eveyone knows how “to Google”. The average query length has almost doubled over the last few years and there is some evidence to show people are using site names to restrict their search – for example “english pgce edge hill”.

Google have also introduced “search this site” boxes into search results which mean that when you search for “Edge Hill” you can filter down your results by making an extra query. Putting “english pgce” into this search box gives results for the query “english pgce site:edgehill.ac.uk”. This feature is already present for most HEIs:

Google Site Search for Edge Hill University

So how many people are using this feature? Our Google Analytics reports show that in the last six months, just 150 visitors to our site came from searches including “site:edgehill.ac.uk”. A much higher number of people are including “Edge Hill” in queries. Excluding searches which are looking diretly for us – for example “Edge Hill University”, “www.edgehill.ac.uk”, or worse or all “edgehill uni” – around 10% of Google referals come with some form of restriction to search our site.

How does this compare to our own site search? We’ve had 270,000 searches in the last six months, with 7% of visitors using it. Additionally, the point where people are wanting to search is usually after they’ve left Google and come to our site. Someone looking for courses for example isn’t going to go back to Google to search – they need seach within the context of the pages they’re looking at.

State of the Union Universities

One weekend in November I spent a few hours going through HEI websites and testing their search engines. On each site I searched for “computing” – most universities offer some form of computing course and it would typically find their IT Services department. I noted down which system they used for search, and whether they provided the ability to search just courses, news and events. All quite basic but it has still offered up some interesting findings.

Google Search Appliance by Adriaan BloemOver the last few years most universities have looked to Google to provide search. 63% of HEIs now use some form of Google-powered search engine. Most of those have bought a Google Search Appliance (or Google Mini) – a server you install locally which provides your own miniture Google Search engine. The interface and results are familiar for users – it’s clean and quick.

Others are using either Google Custom Search Engine or what I’ve recorded as “Google Syndicated Search” – both similar services where Google will spider your site remotely with no requirements to run your own servers. CSE allows you to embed the results within your own page while Syndicated Search allows basic branding to be applied to results displayed on a Google domain.

In a distant second place is ht://Dig – a system that’s been around for years. When I first came across it many years go I was amazed that I could run my own search engine. It was one of the first widely deployed site search systems and so a number of universities were early adoptors. It hasn’t moved with the times – the latest release of ht://Dig was in June 2004.

The short tail includes Ultraseek, Egothor and Novell’s QuickFinder – the system that powered Edge Hill’s site search until April this year. A few sites use search engines built into their CMS or custom built but it’s hard to find out much about these.

This is all very good but in many ways, the engine doesn’t matter – it’s what you do with it that counts!

The ugly, the bad and the good

Google has led the way with clean, uncluttered results pages but some search engines haven’t learnt that less is more and seem compelled to pack in every feature they can. Star ratings or other indicators of the quality of the page add little and can be distracting. Users know to expect that if a page is higher up the list then it’s a closer match to their query so it may not be necessary. On our own search we have a percentage relevance but I’m not sure it adds much value to the visitor so it might be getting cut soon.

The above example from Keele also shows another problem with some search engine blindly indexing the full content of pages. We know exactly which parts of a page are relevant and which are navigation bars, headers or footers. Including these in the index can distort results while adding little value. A search for “Freedom of Information” might return 10,000 pages simply because it’s linked from the footer of every page.

Worse than indexing these areas is displaying the text as part of the result summary. Here the first line of every result is the navigation.

Sussex University searchReturning an unfamiliar layout for the results page can lead to confusion. The University of Sussex have a combined results page showing “top matches”, news, events, and “full text” stacked in one page.

Advanced search

Many site search engines offer some form of “advanced” search and this can be a very useful feature for power users to track down more precisely what they’re looking for. This extra functionality inevitably comes with the risks of extra complexity. With Google Search Appliances the advanced search page is similar to the one on the main Google site and provides some options that might not be necessary. If the entire site is on a single domain or the only language is English, why do you need the option to restrict searches by either of these (maybe this is a question Google should be answering because they can tell from the index if there’s multiple domains or languages).

Topic specific searches provide potential for a whole extra level of complexity. Nottigham Trent University’s course finder requires you to enter level, mode and year of study, subject area and keywords. Why not just a simple box?

Advanced search can be very powerful and easy to use. The University of Warwick have just a few additional options, specifying dates using a drop down rather than a datepicker, and allowing searches to be limited to certain types of document or areas of the site. Edge Hill’s course search has just one “advanced” option – course type.

Tabbed Search

This is something Martin Belam urges caution over. The user may not understand the structure behind this scoping of search and thus limit the results more than they intended. Edge Hill uses tabs to provide scoped search for courses, news and events. Warwick also has a tabbed interface to pull in search results from their blogs and people finder. Care must be taken to ensure that scoping isn’t confused with site navigation and it makes sense to show visually where there are more results. Warwick use some nifty Ajax to load up the number of search results for each tab – an idea I liked so much I implemented it on the Edge Hill too!

More neat ideas

News search on the University of Oxford is integrated right into their news system – results are presented in the same format as the news homepage showing the latest stories making the results pages familiar. Each summary is accompanied by a thumbnail and icons flag up news stories with audio or video attached.

Everyone has a custom 404 page these days to help direct users to what they’re looking for (you do, don’t you?!) so why are most search engines unhelpful when there’s no results? At best they’ll correct your spelling and provide some hints on rewording your query – at worst they’ll just dump “No results found” to screen. This is turning visitors away when they most need our help. Most sites have an A to Z list or sitemap; prospective students can be helped by linking to the full list of courses.

Auto suggest for popular queries can be a good way to encourage users to search more acurately. I’ve not seen anything similar on HEI sites yet but Facebook’s search does a neat job of integrating specific people, groups and organisations with wider search – selecting these takes yuu direct to that page – a really handy quick links feature.

Northumbria’s search tag cloud is a little out of the way but it would be possible to add in a top searchs list somewhere a bit more visible. Storing a user’s search history would be very easy to do (even client side) and with a little logging enabled it would also be possible to implement “people who searched for X also searched for Y” functionality.

The final neat feature I came across was a few sites making use of Google’s KeyMatch feature. This is similar to Google AdWords but allows the site owner to specify what should show up for specified keywords. During my presentation someone suggested that users would blank these results because they see them as adverts. I’m not so sure – I think people will click the links if it’s clear where it’s going to. KeyMatch is a good way of making sure that important pages are at the top of the results even when the search algorithm doesn’t rank them.

All talk?

By now I’m sure you’re fed up of my holier-than-thou ranting about site search and I want to stress that what we’ve done at Edge Hill isn’t perfect by any means! The algorithm used for indexing and searching is quite crude in places but I think we’ve been able to improve the user experience by adding a few neat features and trying to keep in mind what the user is likely to search for, not just what a spider can find.

The search system we use is built on Zend Lucene and has had around a week of development hours over the last year but not everyone will be able to do that, so what can be done with existing resources? In most systems, changing templates to remove unnecessary features of alter the way results are formatted is quite trivial. Salford University have just launched a new site search powered by a Google Search Appliance and their results pages show what can be done with some clever XSLT – gone is the advanced search and look out for the nice file type icons using image replacement.

Finally, one last plug for Martin Belam’s Taking the ‘Ooh’ out of Google – he shows loads of simple ways to improve your results and it’s essential reading for anyone interested in site search.

9 thoughts on “Create a better search engine than Google”

  1. I’d be interested to hear some of the feedback that you were giving on the Warwick experience during your talk – from the slides it seems there are a few points that come up:

    * We didn’t fare well for ‘Computing’ since our degree is called Computer Science, though that’s been fixed now and added as a synonym in our backend (a big win for building your own engine rather than use an OTS solution)
    * Searching for a name that has an exact result isn’t a great experience, since it leaves you on the ‘web’ tab when the ‘people’ tab is better, and you have to have experience of the system to notice the ‘people’ tab at all… we’ve been trying to find a nice solution to this for a while.
    * We don’t have a course search – maybe this is something that’s hurting us, since a lot of the queries that we get for people who aren’t satisfied with their search experience are searching for courses that don’t exist or they are frustrated because they can’t find – something else to look at maybe.

    As an aside, you seemed to think that our range of Advanced options was a little baffling, but they’re hidden away unless you choose to expose them – the vast majority of people just want a box to type in to. It’s difficult to give people the refinement options that they want and keeping it simple enough for the masses to use, unfortunately!

    One thing that I’d like to add is that it’s relatively trivial to start enriching the search experience. Searches for the ‘library’, for example, will show a link to a map of the library on campus as well as the standard web results:

    http://search.warwick.ac.uk/website?q=library

    This is particularly important since the Library have decided to make their homepage just 6 image links – and we do it for a number of other pages, because we think it enriches the user experience.

  2. Hi Mat,

    I didn’t go into the specifics about whether each HEI’s results matched expectations – I just used “computing” as one of my standard queries when testing because it often showed up the differences between academic and support content.

    In terms of Warwick’s performance – it was almost all positive. I was actually using your advanced search as an example of it done better than Google, by providing relevant options and simple to use interfaces (e.g. the date picker).

    I’d be interested to know a bit more about how your search works, if you’d be willing to share?!

  3. The KeyMatch vs. ads is something we talked about in Bath the other week. In the end we decided to remove the special styling and just make it look like the normal top search result.

  4. I thought I’d share these rather than emailing you them, since there might be other people here who’d find them useful – you can contact me at M.Mannion@warwick.ac.uk if you have questions (though the guy I sent them to never got back to me so maybe they’re not useful at all…)

    Our search service is backed by Apache Lucene (http://lucene.apache.org/java/) and indexes the content from our CMS directly. We’ve made a few optimisations on the basic indexing features of Lucene – I’m taking this from the code so bear with me!

    * We boost the title of the page by a factor of 1.5
    * We index a number of fields: Title, description, keywords, parts of the URL, etc.
    * We index directly out of our CMS rather than indexing the HTML output, which lets us get more metadata (for example for news and events) and means we don’t have to strip out the standard stuff like the Privacy statement
    * In our CMS we have a concept of a site, so under http://www.warwick.ac.uk, /services/its (for example) is a site. We boost the results for “Site Roots” (the top page of a site) so they appear higher in results
    * We boost images by 0.5, so they’re half as likely to come up
    * We use ffmpeg to get preview images from video files (and prefer videos with previews to ones without)
    * We use Porter stemming to show hardcoded results for popular queries, so we make sure that searches for “library” come up with the library’s homepage at the top, despite the fact that the homepage has no real content
    * We also show a snippet of hardcoded HTML for specific results, showing a map, opening hours etc. as well as crawled information like upcoming shows at the Arts Centre, so the most useful information is easily available
    * We look for patterns and show relevant information based on those – if you search for an email address you might be looking for people search, or your own inbox; if you search for a login code, you probably want to know how to log in; if you search for a room number you probably want to know where it is (e.g. http://search.warwick.ac.uk/website?q=r0.21)
    * We open all our APIs for searching specific or crawled information in RSS, Atom and JSON so that users can use it to create custom pages (such as showing the latest video uploaded under a certain site with certain keywords, for example) _ I blogged about this here: http://blogs.warwick.ac.uk/mmannion/entry/providing_json_data/

    When we do queries, we have a Lucene FunctionQuery that weights newer documents higher in a cosine-function, so there is a small tail-off (documents updated in the last three months or so are unaffected) that regresses fairly quickly, so documents not updated in over a year require fairly specific searches to find.

    Hope some of that is useful :)

Comments are closed.