Hacks meet Hackers

Francis Irving

It’s been a busy week with the Institutional Web Management Workshop last Monday to Wednesday in Sheffield and WordCamp UK happening in Manchester this weekend but on Friday I took a day off to pop down to a hack day in Liverpool.

The event was hosted by LJMU’s Open Labs at the Art and Design Academy in partnership with ScraperWiki and Trinity Mirror Merseyside (think Liverpool Daily Post and Echo, Ormskirk Advertiser, Southport Visiter etc etc!). The idea was that hacks (journalists) meet hackers (coders, not to be confused with crackers who break into systems!) for a day working on datasets to produce something at the end of the day.

The basic format was splitting into teams comprising a few hacks and a few hackers with an interest in a particular subject, being put into a “booth” for 6 hours and seeing what happened. The group I was in was focused around Liverpool datasets – think Doctors surgeries, educational statistics etc.

I’ve come across ScraperWiki before at Liver and Mash and while there was no requirement to use their system, since it’s recently added support for PHP alongside Python I thought I’d give it a try. We found some data to scrape on the NHS website and set about building a scraper.

The chaps at ScraperWiki would be the first to admit that their support for PHP is still very much beta and so it was a little harder than I expected. Eventually I got it scraping a set of data and used Yahoo Pipes to add location data to allow it to be mapped. Here’s what it looks like on Google Earth alongside school and transport datasets:

Google Earth Three Layers

Okay, so not terribly exciting but it was useful to have a go at ScraperWiki and get an idea of some of the things that can be done with it. You can find my scraper on the ScraperWiki site; the Pipe is also available.

I think it was also very interesting to get journalists to meet coders. A few weeks ago I heard someone (possibly Alison Gow) say recently that you can’t get a job for the Guardian without talking about data and it’s becoming an increasingly important part of journalism. No longer is it enough to simply report the news or spout opinion – being open about where your data comes from can be just as important. So it was really good that Trinity Mirror are taking this so seriously.

Someone in my team asked when I raised the idea of using DBpedia (and hence Wikipedia) data how reliable it was and could it be trusted. My response was to point out that most Wikipedia articles cite their sources and asked how many news stories do the same!

I’m getting off topic now so I’ll leave it there! ScraperWiki are running a series of Hack Days across the UK (and beyond) so if you’re interested, make sure you sign up!

Social Media Café Liverpool

Last night was Liverpool’s first Social Media Café at Static. SMC’s are nothing new – they’ve been running in cities around the UK, and the world, for a while but it’s good to see one happening closer to home.

The format for the evening was three speakers with generous breaks between to grab a beverage and “network”. The organisers got some great talks:

Alison Gow: Data and the art of storytelling

From Alison we learn that you can’t get a job for the Guardian without talking about data! Alison has written up a blog post about her talk so go read it!

Josh: How to win Foursquare friends and influence people

How to win @foursquare friends and influence people by @technicalfault

Josh is involved in organising Social Media Café Manchester and popped down the road to talk about Foursquare. Once again, Josh has blogged about the subject so go read that.

I’ve got a blog post in draft (which has fallen foul to my 48 hour rule) about Foursquare and how we might be able to use it as a University. Hopefully I’ll be inspired to look at it again and publish it in the next week or two.

David Coveney: Social media and work

@davecoveney #smvliv #smday

Final talk of the night was Dave Coveney talking about how work and social media mix. Once again his slides – as a Prezi – are online. They probably make about as much sense as Dave’s talk, and I say that as a compliment! It was very engaging walk through the history of social media (anyone remember CIX?) and how he makes use of social media personally, with the business as a side effect.

So overall a great first SMC Liverpool. There was some discussion about the direction to take the events but it will probably be a monthly thing. I’ve added the hashtag #smcliv to TwapperKeeper so you should be able to read through the archive of tweets there as it fills up.

Liver and Mash

I’ve already blogged about my own Mashed Library Liverpool talk but I promised to say something about the rest of the event, so here goes!

Mandy Phillips and Owen Stephens

Mandy Phillips kicks of Liver and Mash

The day kicked off with welcome and introductions from Mandy and Owen. I’d heard bits about Mashed Library events before and I know the basics of Mashups but I didn’t really know who would be there and and what to expect. There was a good mix of attendees and speakers presenting “lightening talks”, “Pecha Kucha 20:20″ talks and workshops. The thing that persuaded me to agree to speak and convinced me that it wouldn’t just be a bunch of librarians (!!) was the scattering of local speakers…

Alison Gow

Alison Gow

Alison is Executive Editor (Digital) for Trinity Mirror Merseyside, publishers of the Liverpool Daily Post and Echo. Despite “knowing” her through the Twitter, Friday’s Mashed Libraries event was the first time I’d met her IRL! The slides of her talk “Open Curation of Data” are online covering some of the things journalists and the newspaper industry have had to deal with since the superinterweb came along.

Aidan McGuire and Julian Todd

Julian Todd and Aidan McGuire on ScraperWiki

Aidan and Julian demonstrated ScraperWiki a project supported by 4iP and aiming free data from inaccessible sources and make it available for those who wish to use it in new and innovative ways, for example mashups. “Screen Scraping” isn’t a new idea but typically it’s done by individuals, embedded into their own systems. If the scraped website changes then the feed breaks and there’s no way for others to build on the work done.

ScraperWiki aims to change that by providing a community driven source for storing scrapers. It’s like Wikipedia for code allowing you to take and modify a scraper I’ve written for your own purposes.

There are already dozens of scraped data sources and more are being added every day. It currently supports Python but my language of choice – PHP – will be added soon so I’ll be giving it a go then.

John McKerrell

John McKerrell on Mapping

John’s talk about mapping had the most interest so he presented it to all attendees briefly covering mapping APIs, OpenStreetMap and tracking your location with mapme.at.

Phil Bradley

The first Pecha Kucha 20:20 talk was about social media search tools. I wasn’t writing down the links so check on Phil’s Slideshare page for the presentation coming out. I will say that Google’s support for Twitter is now much better than he seemed to suggest – for example allowing you to drill into tweets for a particular time. It can also be more reliable than search.twitter.com when using shared IP addresses at a conference.

Gary Green

Gary Green 20/20 talk

Gary mentioned that this was his first presentation so I’m not sure a 20:20 talk was the best idea but he handled it pretty well!

Tony Hirst

Tony Hirst talking about Yahoo! Pipes

The afternoon was dedicated to one of three workshops – Arduino with Adrian Mcewen, Mapping with John McKerrell or Mashups with Tony Hirst. I’ve done a bit of each before so I sat at the back of Tony’s talk to try to soak up some new tips.

After a final cake break there was the prize giving for the mashup suggestions competition.

@briankelly, @m8nd1 and @ostephens presenting prizes

So all in all a really interesting day! Congratulations to Mandy Phillips and all the organising team for an excellent event.

Create a better search engine than Google

The findings and opinions contained within this post are entirely mine and do not necessarily reflect those of Edge Hill University. The research and write up was done in my own time and is only posted here because it may be of interest to the HE community.

At BarCamp Liverpool I gave a talk about site search. It covered the same sort of topics as Martin Belam’s Euro IA Summit presentation Taking the ‘Ooh’ out of Google and I recommend that anyone interested in site search read his series of blog posts. Go read it now – it’s way better than I’m going to write here! While Martin reviewed news websites from across Europe, I’ve turned my attention to university websites and look at what institutions are doing now and how it can be improved.

Do we really need site search?

Before we get too tied up in what to do, it’s perfectly valid to question whether we even need a site search engine any more. Every internet user knows how to search – specifically, eveyone knows how “to Google”. The average query length has almost doubled over the last few years and there is some evidence to show people are using site names to restrict their search – for example “english pgce edge hill”.

Google have also introduced “search this site” boxes into search results which mean that when you search for “Edge Hill” you can filter down your results by making an extra query. Putting “english pgce” into this search box gives results for the query “english pgce site:edgehill.ac.uk”. This feature is already present for most HEIs:

Google Site Search for Edge Hill University

So how many people are using this feature? Our Google Analytics reports show that in the last six months, just 150 visitors to our site came from searches including “site:edgehill.ac.uk”. A much higher number of people are including “Edge Hill” in queries. Excluding searches which are looking diretly for us – for example “Edge Hill University”, “www.edgehill.ac.uk”, or worse or all “edgehill uni” – around 10% of Google referals come with some form of restriction to search our site.

How does this compare to our own site search? We’ve had 270,000 searches in the last six months, with 7% of visitors using it. Additionally, the point where people are wanting to search is usually after they’ve left Google and come to our site. Someone looking for courses for example isn’t going to go back to Google to search – they need seach within the context of the pages they’re looking at.

State of the Union Universities

One weekend in November I spent a few hours going through HEI websites and testing their search engines. On each site I searched for “computing” – most universities offer some form of computing course and it would typically find their IT Services department. I noted down which system they used for search, and whether they provided the ability to search just courses, news and events. All quite basic but it has still offered up some interesting findings.

Google Search Appliance by Adriaan BloemOver the last few years most universities have looked to Google to provide search. 63% of HEIs now use some form of Google-powered search engine. Most of those have bought a Google Search Appliance (or Google Mini) – a server you install locally which provides your own miniture Google Search engine. The interface and results are familiar for users – it’s clean and quick.

Others are using either Google Custom Search Engine or what I’ve recorded as “Google Syndicated Search” – both similar services where Google will spider your site remotely with no requirements to run your own servers. CSE allows you to embed the results within your own page while Syndicated Search allows basic branding to be applied to results displayed on a Google domain.

In a distant second place is ht://Dig – a system that’s been around for years. When I first came across it many years go I was amazed that I could run my own search engine. It was one of the first widely deployed site search systems and so a number of universities were early adoptors. It hasn’t moved with the times – the latest release of ht://Dig was in June 2004.

The short tail includes Ultraseek, Egothor and Novell’s QuickFinder – the system that powered Edge Hill’s site search until April this year. A few sites use search engines built into their CMS or custom built but it’s hard to find out much about these.

This is all very good but in many ways, the engine doesn’t matter – it’s what you do with it that counts!

The ugly, the bad and the good

Google has led the way with clean, uncluttered results pages but some search engines haven’t learnt that less is more and seem compelled to pack in every feature they can. Star ratings or other indicators of the quality of the page add little and can be distracting. Users know to expect that if a page is higher up the list then it’s a closer match to their query so it may not be necessary. On our own search we have a percentage relevance but I’m not sure it adds much value to the visitor so it might be getting cut soon.

The above example from Keele also shows another problem with some search engine blindly indexing the full content of pages. We know exactly which parts of a page are relevant and which are navigation bars, headers or footers. Including these in the index can distort results while adding little value. A search for “Freedom of Information” might return 10,000 pages simply because it’s linked from the footer of every page.

Worse than indexing these areas is displaying the text as part of the result summary. Here the first line of every result is the navigation.

Sussex University searchReturning an unfamiliar layout for the results page can lead to confusion. The University of Sussex have a combined results page showing “top matches”, news, events, and “full text” stacked in one page.

Advanced search

Many site search engines offer some form of “advanced” search and this can be a very useful feature for power users to track down more precisely what they’re looking for. This extra functionality inevitably comes with the risks of extra complexity. With Google Search Appliances the advanced search page is similar to the one on the main Google site and provides some options that might not be necessary. If the entire site is on a single domain or the only language is English, why do you need the option to restrict searches by either of these (maybe this is a question Google should be answering because they can tell from the index if there’s multiple domains or languages).

Topic specific searches provide potential for a whole extra level of complexity. Nottigham Trent University’s course finder requires you to enter level, mode and year of study, subject area and keywords. Why not just a simple box?

Advanced search can be very powerful and easy to use. The University of Warwick have just a few additional options, specifying dates using a drop down rather than a datepicker, and allowing searches to be limited to certain types of document or areas of the site. Edge Hill’s course search has just one “advanced” option – course type.

Tabbed Search

This is something Martin Belam urges caution over. The user may not understand the structure behind this scoping of search and thus limit the results more than they intended. Edge Hill uses tabs to provide scoped search for courses, news and events. Warwick also has a tabbed interface to pull in search results from their blogs and people finder. Care must be taken to ensure that scoping isn’t confused with site navigation and it makes sense to show visually where there are more results. Warwick use some nifty Ajax to load up the number of search results for each tab – an idea I liked so much I implemented it on the Edge Hill too!

More neat ideas

News search on the University of Oxford is integrated right into their news system – results are presented in the same format as the news homepage showing the latest stories making the results pages familiar. Each summary is accompanied by a thumbnail and icons flag up news stories with audio or video attached.

Everyone has a custom 404 page these days to help direct users to what they’re looking for (you do, don’t you?!) so why are most search engines unhelpful when there’s no results? At best they’ll correct your spelling and provide some hints on rewording your query – at worst they’ll just dump “No results found” to screen. This is turning visitors away when they most need our help. Most sites have an A to Z list or sitemap; prospective students can be helped by linking to the full list of courses.

Auto suggest for popular queries can be a good way to encourage users to search more acurately. I’ve not seen anything similar on HEI sites yet but Facebook’s search does a neat job of integrating specific people, groups and organisations with wider search – selecting these takes yuu direct to that page – a really handy quick links feature.

Northumbria’s search tag cloud is a little out of the way but it would be possible to add in a top searchs list somewhere a bit more visible. Storing a user’s search history would be very easy to do (even client side) and with a little logging enabled it would also be possible to implement “people who searched for X also searched for Y” functionality.

The final neat feature I came across was a few sites making use of Google’s KeyMatch feature. This is similar to Google AdWords but allows the site owner to specify what should show up for specified keywords. During my presentation someone suggested that users would blank these results because they see them as adverts. I’m not so sure – I think people will click the links if it’s clear where it’s going to. KeyMatch is a good way of making sure that important pages are at the top of the results even when the search algorithm doesn’t rank them.

All talk?

By now I’m sure you’re fed up of my holier-than-thou ranting about site search and I want to stress that what we’ve done at Edge Hill isn’t perfect by any means! The algorithm used for indexing and searching is quite crude in places but I think we’ve been able to improve the user experience by adding a few neat features and trying to keep in mind what the user is likely to search for, not just what a spider can find.

The search system we use is built on Zend Lucene and has had around a week of development hours over the last year but not everyone will be able to do that, so what can be done with existing resources? In most systems, changing templates to remove unnecessary features of alter the way results are formatted is quite trivial. Salford University have just launched a new site search powered by a Google Search Appliance and their results pages show what can be done with some clever XSLT – gone is the advanced search and look out for the nice file type icons using image replacement.

Finally, one last plug for Martin Belam’s Taking the ‘Ooh’ out of Google – he shows loads of simple ways to improve your results and it’s essential reading for anyone interested in site search.

Ian Forrester on Backstage

Last night I went down to my second GeekUp meet at 3345 Parr Street in Liverpool. If you don’t know what GeekUp is, here’s a quote from the website:

GeekUp is a community of web designers, web developers, and other tech-minded folk from the North West. Our socials take place once a month in Leeds, Liverpool, Manchester, Preston and Sheffield they are always a lively place to share ideas and spread a little knowledge.

Ian Forrester.  Creative Commons licenced by Gavin BellThere’s usually a couple of talks before moving to the bar for chat and beer and this month’s talk was by Ian Forrester, the man behind backstage.bbc.co.uk.

Backstage is a community built around data made available by the BBC. It encourages the public to make use of the data for cool stuff and highlights what the Beeb is offering. I’m not going to go into all the prototypes which have come out of backstage or list the feeds and APIs they advertise – you can find that out from the website – but there’s other interesting things going on as well!

Since backstage started, it’s focus has been on feeds and APIs but that seems to be changing now. They’ll soon be starting a fortnightly online show featuring interviews with the tech community, introducing the work people are doing and explaining the web in a bit more detail than BBC Webwise. This will be done on a shoestring, but with help from other areas of the BBC (such as Click) they hope to maintain high production standards.

At a slightly larger scale, backstage are joining up with IT Conversations to record speakers at UK based conferences. Traditionally there’s been a notable US-bias towards this kind of material so it will be great to see a bit more variety to the speakers.

The final thing (that I’ll talk about) is the support backstage and Ian himself are giving for tech events. While living in London, Ian organised BarCamps, GeekDinners and supported dozens of events. With his move to Manchester, he’ll be shifting some of his attention to what the North can offer. There was discussion of starting GeekDinners in Manchester (not as a direct competitor to GeekUp, it should be noted) and other web/tech events in the North are getting backstage support and sponsorship.

So a very interesting and informative GeekUp Liverpool this month, and very different to the last one I attended. There’s a great community of web developers and designers in and around Liverpool and GeekUp can play an important part in bringing people together so I’d encourage anyone with an interest in the web professionally (or even just a strong interest in technology) to pop along to see what it’s all about.

>