Monthly Archives: December 2008

Internet Explorer Security Alert

So the BBC have finally picked up the news and jumped on the bandwagon. Mass media are now telling you to switch to a more secure web browser (you know, the thing your using to view this web page with).

From the BBC:

The flaw in Microsoft’s Internet Explorer could allow criminals to take control of people’s computers and steal their passwords, internet experts say.

As many as 10,000 websites have been compromised since last week to take advantage of the security flow, said antivirus software maker Trend Micro.

Are you ready to make the switch? I certainly don’t want my passwords or bank account details stolen and my bank account emptied, do you?

For you home computers and laptops: Get Firefox now!

Steve Daniels

Create a better search engine than Google

The findings and opinions contained within this post are entirely mine and do not necessarily reflect those of Edge Hill University. The research and write up was done in my own time and is only posted here because it may be of interest to the HE community.

At BarCamp Liverpool I gave a talk about site search. It covered the same sort of topics as Martin Belam’s Euro IA Summit presentation Taking the ‘Ooh’ out of Google and I recommend that anyone interested in site search read his series of blog posts. Go read it now – it’s way better than I’m going to write here! While Martin reviewed news websites from across Europe, I’ve turned my attention to university websites and look at what institutions are doing now and how it can be improved.

Do we really need site search?

Before we get too tied up in what to do, it’s perfectly valid to question whether we even need a site search engine any more. Every internet user knows how to search – specifically, eveyone knows how “to Google”. The average query length has almost doubled over the last few years and there is some evidence to show people are using site names to restrict their search – for example “english pgce edge hill”.

Google have also introduced “search this site” boxes into search results which mean that when you search for “Edge Hill” you can filter down your results by making an extra query. Putting “english pgce” into this search box gives results for the query “english pgce site:edgehill.ac.uk”. This feature is already present for most HEIs:

Google Site Search for Edge Hill University

So how many people are using this feature? Our Google Analytics reports show that in the last six months, just 150 visitors to our site came from searches including “site:edgehill.ac.uk”. A much higher number of people are including “Edge Hill” in queries. Excluding searches which are looking diretly for us – for example “Edge Hill University”, “www.edgehill.ac.uk”, or worse or all “edgehill uni” – around 10% of Google referals come with some form of restriction to search our site.

How does this compare to our own site search? We’ve had 270,000 searches in the last six months, with 7% of visitors using it. Additionally, the point where people are wanting to search is usually after they’ve left Google and come to our site. Someone looking for courses for example isn’t going to go back to Google to search – they need seach within the context of the pages they’re looking at.

State of the Union Universities

One weekend in November I spent a few hours going through HEI websites and testing their search engines. On each site I searched for “computing” – most universities offer some form of computing course and it would typically find their IT Services department. I noted down which system they used for search, and whether they provided the ability to search just courses, news and events. All quite basic but it has still offered up some interesting findings.

Google Search Appliance by Adriaan BloemOver the last few years most universities have looked to Google to provide search. 63% of HEIs now use some form of Google-powered search engine. Most of those have bought a Google Search Appliance (or Google Mini) – a server you install locally which provides your own miniture Google Search engine. The interface and results are familiar for users – it’s clean and quick.

Others are using either Google Custom Search Engine or what I’ve recorded as “Google Syndicated Search” – both similar services where Google will spider your site remotely with no requirements to run your own servers. CSE allows you to embed the results within your own page while Syndicated Search allows basic branding to be applied to results displayed on a Google domain.

In a distant second place is ht://Dig – a system that’s been around for years. When I first came across it many years go I was amazed that I could run my own search engine. It was one of the first widely deployed site search systems and so a number of universities were early adoptors. It hasn’t moved with the times – the latest release of ht://Dig was in June 2004.

The short tail includes Ultraseek, Egothor and Novell’s QuickFinder – the system that powered Edge Hill’s site search until April this year. A few sites use search engines built into their CMS or custom built but it’s hard to find out much about these.

This is all very good but in many ways, the engine doesn’t matter – it’s what you do with it that counts!

The ugly, the bad and the good

Google has led the way with clean, uncluttered results pages but some search engines haven’t learnt that less is more and seem compelled to pack in every feature they can. Star ratings or other indicators of the quality of the page add little and can be distracting. Users know to expect that if a page is higher up the list then it’s a closer match to their query so it may not be necessary. On our own search we have a percentage relevance but I’m not sure it adds much value to the visitor so it might be getting cut soon.

The above example from Keele also shows another problem with some search engine blindly indexing the full content of pages. We know exactly which parts of a page are relevant and which are navigation bars, headers or footers. Including these in the index can distort results while adding little value. A search for “Freedom of Information” might return 10,000 pages simply because it’s linked from the footer of every page.

Worse than indexing these areas is displaying the text as part of the result summary. Here the first line of every result is the navigation.

Sussex University searchReturning an unfamiliar layout for the results page can lead to confusion. The University of Sussex have a combined results page showing “top matches”, news, events, and “full text” stacked in one page.

Advanced search

Many site search engines offer some form of “advanced” search and this can be a very useful feature for power users to track down more precisely what they’re looking for. This extra functionality inevitably comes with the risks of extra complexity. With Google Search Appliances the advanced search page is similar to the one on the main Google site and provides some options that might not be necessary. If the entire site is on a single domain or the only language is English, why do you need the option to restrict searches by either of these (maybe this is a question Google should be answering because they can tell from the index if there’s multiple domains or languages).

Topic specific searches provide potential for a whole extra level of complexity. Nottigham Trent University’s course finder requires you to enter level, mode and year of study, subject area and keywords. Why not just a simple box?

Advanced search can be very powerful and easy to use. The University of Warwick have just a few additional options, specifying dates using a drop down rather than a datepicker, and allowing searches to be limited to certain types of document or areas of the site. Edge Hill’s course search has just one “advanced” option – course type.

Tabbed Search

This is something Martin Belam urges caution over. The user may not understand the structure behind this scoping of search and thus limit the results more than they intended. Edge Hill uses tabs to provide scoped search for courses, news and events. Warwick also has a tabbed interface to pull in search results from their blogs and people finder. Care must be taken to ensure that scoping isn’t confused with site navigation and it makes sense to show visually where there are more results. Warwick use some nifty Ajax to load up the number of search results for each tab – an idea I liked so much I implemented it on the Edge Hill too!

More neat ideas

News search on the University of Oxford is integrated right into their news system – results are presented in the same format as the news homepage showing the latest stories making the results pages familiar. Each summary is accompanied by a thumbnail and icons flag up news stories with audio or video attached.

Everyone has a custom 404 page these days to help direct users to what they’re looking for (you do, don’t you?!) so why are most search engines unhelpful when there’s no results? At best they’ll correct your spelling and provide some hints on rewording your query – at worst they’ll just dump “No results found” to screen. This is turning visitors away when they most need our help. Most sites have an A to Z list or sitemap; prospective students can be helped by linking to the full list of courses.

Auto suggest for popular queries can be a good way to encourage users to search more acurately. I’ve not seen anything similar on HEI sites yet but Facebook’s search does a neat job of integrating specific people, groups and organisations with wider search – selecting these takes yuu direct to that page – a really handy quick links feature.

Northumbria’s search tag cloud is a little out of the way but it would be possible to add in a top searchs list somewhere a bit more visible. Storing a user’s search history would be very easy to do (even client side) and with a little logging enabled it would also be possible to implement “people who searched for X also searched for Y” functionality.

The final neat feature I came across was a few sites making use of Google’s KeyMatch feature. This is similar to Google AdWords but allows the site owner to specify what should show up for specified keywords. During my presentation someone suggested that users would blank these results because they see them as adverts. I’m not so sure – I think people will click the links if it’s clear where it’s going to. KeyMatch is a good way of making sure that important pages are at the top of the results even when the search algorithm doesn’t rank them.

All talk?

By now I’m sure you’re fed up of my holier-than-thou ranting about site search and I want to stress that what we’ve done at Edge Hill isn’t perfect by any means! The algorithm used for indexing and searching is quite crude in places but I think we’ve been able to improve the user experience by adding a few neat features and trying to keep in mind what the user is likely to search for, not just what a spider can find.

The search system we use is built on Zend Lucene and has had around a week of development hours over the last year but not everyone will be able to do that, so what can be done with existing resources? In most systems, changing templates to remove unnecessary features of alter the way results are formatted is quite trivial. Salford University have just launched a new site search powered by a Google Search Appliance and their results pages show what can be done with some clever XSLT – gone is the advanced search and look out for the nice file type icons using image replacement.

Finally, one last plug for Martin Belam’s Taking the ‘Ooh’ out of Google – he shows loads of simple ways to improve your results and it’s essential reading for anyone interested in site search.