Shorter URLs

QR Code for GORecently we successfully registered an additional domain name – ehu.ac.uk – for the University. Rather than simply using this as an additional alias for the main website addresses, we’re using it to provide a URL shortening service.

URL shortening services are nothing new – TinyURL was launched in 2002 – but while for years they were used to shorten web addresses in emails, with the advent of Twitter and its 140 character limits these services have gained new popularity.

These services do have some major problems however, notable, what happens if a service goes out of business either through running out of money or by the top level domain owner cancelling it? This has led many people to consider running their own service, and now that we have a nice short URL, we’re following suit.

We are using the popular YOURLS system, written in PHP with some custom plugins:

  • Lowercase URLs: we want short URLs to be case insensitive so that it doesn’t matter how people type them in
  • Top Level URLs keep their keyword for our main domain name, so www.edgehill.ac.uk/english maps to ehu.ac.uk/english
  • For our own domain names we add in Google Analytics campaign keywords allowing us to determine where traffic comes from
  • URLs can be modified to include a source with just three extra characters which is then passed through as a Google Analytics medium
  • QR codes are available for all short URLs by simply adding .qr to the keyword
  • Certain keywords relate to the type of content, for example undergraduate courses have been seeded with their UCAS code, e.g. ehu.ac.uk/g401 is BSc Computing

This service is currently in beta for use with the new prospectus but we’ll be making use of it further in the near future, for example exposing short URLs for pages within GO.

Let us know if you have any ideas for other things we can do with this service!

Bad URLs Part 4: Choose your future

So we’ve looked at examples of bad and good URLs, what the current situation is at Edge Hill, but what are we doing about it?

Lots, you’ll be pleased to know! As part of the development work we’ve been doing for the new website design, I’ve been taking a long hard look at how our website is structured and plan to make some changes. There are two areas to the changes – ensuring our new developments are done properly and changing existing areas of the site to fit in with the new structure.

Firstly the new developments. We’re currently working on three systems – news, events and prospectus. News was one of the examples I gave last time where we could make some improvements so let’s look at how things might change.

Firstly, all our new developments are being brought onto the main Edge Hill domain – www.edgehill.ac.uk – and each “site” placed in a top level address:

http://info.edgehill.ac.uk/EHU_news/
becomes:
http://www.edgehill.ac.uk/news

News articles will drop references to the technology used and the database IDs:

http://info.edgehill.ac.uk/EHU_news/story.asp?storyid=765
becomes:
http://www.edgehill.ac.uk/news/2008/01/the-performance-that-is-more-canal-than-chanel

In this example the new URL is actually longer than the old one, but I can live with that because it’s more search engine friendly and the structure is human-readable. For example we can guess that the monthly archive page will be:

http://www.edgehill.ac.uk/news/2008/01

This idea is nothing new – for the first few years of the web most sites had a pretty logical structure – but it’s something that has been lost when moving to Content Management Systems.

The online prospectus is getting similar treatment where courses are currently referenced by an ID number the URL will contain the course title:

http://info.edgehill.ac.uk/EHU_eprospectus/details/BS0041.asp
becomes:
http://www.edgehill.ac.uk/study/courses/computing

As part of our JISC funded mini-project, we’ll be outputting XCRI feeds from the online prospectus. The URLs for these will be really simple – just add /xcri to the end of the address:

http://www.edgehill.ac.uk/study/courses/xcri
http://www.edgehill.ac.uk/study/courses/computing/2009/xcri

In the news site, feeds of articles, tags, comments and much more will be available simply by adding /feed to the URL. Same will apply to search results.

All this is great for the new developments, but we do have a lot of static pages that won’t be replaced. Instead, pages will move to a flatter URL structure. For example, the Faculty of Education site will be available directly through what is currently the vanity URL meaning that most subpages also have a nice URL:

http://www.edgehill.ac.uk/Faculties/Education/Research/index.htm
becomes:
http://www.edgehill.ac.uk/education/research

Areas of the site which were previously hidden away three or four levels deep will be made more accessible through shorter URLs.

How are we doing this? The core of the new site is a brand new symfony based application. This allows us to embed our dynamic applications – news, events and prospectus – more closely into the site than has previously been possible. symfony allows you to define routing rules which while look complex on the backend because of the way they mix up pages, produce a uniform look to the structure of the site.

For our existing content we’re using a combination of some lookup tables in our symfony application and some Apache mod_rewrite rules to detect requests for existing content. All the existing pages will be redirected to their new locations so any bookmarks will continue to work and search engines will quickly find the new versions of pages.

That’s all for this little series of posts about URLs. Hopefully it has helped explain some of my thinking behind the changes. If you’ve got any questions then drop me an email or post a comment.

Bad URLs Part 3: Confessions time

Over the last couple of posts I’ve been looking at URLs, good and bad. Now it’s time to examine what we do at Edge Hill and see how we fare!

Most of our website is currently made up of static pages so it looks something like this:

http://www.edgehill.ac.uk/Faculties/Education/index.html

Other areas of the site aren’t quite so great:

http://www.edgehill.ac.uk/Faculties/FAS/English/History/index.html

Not terribly bad but it’s a little bit long – I wouldn’t like to read it out over the phone and because the URL is structured to mirror the department, when names change, the URL could change as well.

The site structure is quite deep which has led to some quite strange locations for pages, for example the copyright page linked to from every page on the site is within the Web Services area of the site:

http://www.edgehill.ac.uk/Sites/ITServices/WebServ/Copyright.htm

For use in publications, there’s a whole bunch of “vanity URLs” like this:

http://www.edgehill.ac.uk/education

And that will redirect you to the page you’re looking for. These are great – easy to read over the phone or type in but since most of them force redirect to the actual page, most people don’t know about them – if you copy and paste into a document you’re producing, you’ll get the long URL. But they’re also not universal – short URLs exist for some departments and services but not others and sometimes it can be hard to pick a good vanity URL.

When we look at some of our dynamic content however, things aren’t quite so great.

http://info.edgehill.ac.uk/EHU_news/article.asp?id=4786

What’s wrong with this?

  • Mystery “info” server – splitting page rankings on Google
  • Tell the user what language we’re using for the page (ASP)
  • Meaningless ID numbers
  • EHU_news – why not just news?
  • Nothing that search engines can pick up on for keywords

The first site I worked on for Edge Hill – Education Partnership – is a bit of a mixed bag:

https://go.edgehill.ac.uk/ep/static/primary-mentors

It’s fairly readable with words describing the pages rather than ID numbers, but it’s hosted as part of the GO website despite being in the main website template. There’s also the “static” in there which is a by-product of the implementation rather than being an relevant to the URL. I’ve learnt my lesson and it won’t happen again.

Overall I think we score 5/10 – no nightmare URLs but lots of scope for improvement and next time I’ll be looking at some of the plans to change how our sites are structured and maybe get a little technical about how we’re implementing it!

Bad URLs Part 2: The Beauty of URLs

Last time I gave some examples of awful URLs but not everyone gets it wrong. Let me give you some examples of truly beautify URL structures and explain the benefits of them!

BBC LogoAsk Auntie

If you ask almost any UK-based web developer for a list of the best produced websites, the Beeb will be pretty high up. They do a lot of things very well, and you’d expect so with their budget! URLs are just one example. Think of a major TV programme on the BBC and add the name after bbc.co.uk and 95% of the time that’s the address of the website. Try it out…

http://www.bbc.co.uk/newsnight

Considering the size of the BBC site, they seem to have a very well organised structure. Not too many levels deep – usually only one or two – and URLs stay around for a very long time. Check out Politics 97, or the Diana Remembered website. See how even when names change the content follows – bbc.co.uk/childrens now takes you to an index page linking to CBeebies and CBBC.

A new development from the BBC, still in beta, is even more impressive. Their new BBC Programmes site is an index of every TV and radio programme shown on BBC stations. For each series it lists episodes, and scheduling information. Great, but didn’t the channel listings do this already? No – that only showed the next week and didn’t contain an archive; the new site gives every series, episode and showing a unique, permanent URL.

Programmes are represented by a short alphanumeric identifier rather than their full name:

http://www.bbc.co.uk/programmes/b006mk25

This has the advantage of being short but is hard to predict. In one of the comments on their introduction to the programmes site (and some other cool stuff they do with URLs), Michael Smethurst explains the reasoning behind their chosen structure:

We thought long and hard about the best way to make programmes addressable and, as ever, there’s no perfect solution. So…

…no channel cos not only do episodes get broadcast on multiple channels they can also change “channel ownership” over time.
[…]
and no brand > series > episode cos so many programmes don’t fit this model.
[…]
We’d love to have made human readable/hackable AND persistent urls (and have on the aggregation pages) but it just wasn’t possible

There’s another cool feature of BBC Programmes mentioned in that post:

Were also working on additional views so that in the near future by adding .json, .mobile, .rss, .atom, .iCal or .yaml to the end of the URL will give you that resource in that format.

You might not know (or care!) what each of those formats is, but what it means for every user is that they’re free to take the information that the BBC provide and use it within their own system. Already there is microformatted information embedded into every page.

Accessible UK Train TimetablesTrain Times done right

Another fantastic example of beautiful URL structure is from traintimes.org.uk. This site is an alternative to the awful official site which provide rail information. They offer a fully accessible interface to train times and fares in a format much easier to browse and navigate than National Rail. But along side the forms letting you search is some URL magic. Say you want to travel from Liverpool to London, simply tag it on to the end of the URL:

http://traintimes.org.uk/liverpool/london

Not leaving right now. Okay…

http://traintimes.org.uk/liverpool/london/20:30

Not leaving today? That’s fine too:

http://traintimes.org.uk/liverpool/london/08:00/wednesday

Want the price?

http://traintimes.org.uk/liverpool/london/08:00/wednesday/fares

The Train Times site has so much flexibility – you can use station codes instead of the full name and it will recognise a variety of date formats. National Rail could learn a lot!

That’s enough examples for now, but there will be more later on. Next time I’ll be looking at Edge Hill’s URLs and seeing what we’re doing right, but more importantly where we can improve.

Bad URLs Part 1: When URLs go bad

The humble URL is one of the most unloved things on the internet, yet without it there wouldn’t be a World Wide Web.

For the less techie out there, URLs are web addresses such as http://www.edgehill.ac.uk/. The identify every web site, page, image and video on the internet and on the whole they’ve done a pretty good job over the last 30 years.

In the beginning things were simple. You put a bunch of web pages in some directories on your server and there they were on the interweb. When you uploaded a page it would likely stay there forever. As the web grew, content moved from being static to dynamically generated and this is where it all started to go wrong.

Developers created ways of generating pages using scripts to pull information out of databases or from user input. As developers have a habbit of doing, they get caught up in the technology and lost sight of the user.

Have you ever looked at a web address and thought it was a foreign language? PHP, ASP, JSP, .do at the end of file names – these all indicate the scripting language used to create the website. I might find this interesting, but I bet 99% of people don’t!

Then there’s the query string – that’s the bit after the question mark in a URL. It tells the script extra information that it might need to know about the page you want. Very important, and certainly not bad in itself, but too often there is useless extra information passed in which means the URLs are too long and several subtlety different URLs might actually return the same result.

Ugly, long and and overly complex URLs are something that’s bothered me for quite a while. In the past I’ve created sites with some truely awful URL structures and it’s not big or clever – now I’m committed to doing things right. This is a topic that’s been discussed for a very long time – TBL‘s Cool URIs don’t change is a decade old; more lately Search Engine Optimisation rather than the idealistic goal of a pure site structure has been the main drive for clean URLs.

Let me give a few examples of Bad URLs. First up is Auto Trader:

http://search.autotrader.co.uk/es-uk/www/cars/FORD+KA/Ne-2-4-5-6-7-8-27-44-49-53-61-64-67-103-133-146,N-19-29-4294966844-4294967194/advert.action?R=200804302411772&distance=24&postcode=L39+4QP&channel=CARS&make=FORD&model=KA&min_pr=500&max_pr=5000&max_mileage=

You won’t be able to see the full link, but it contains loads of pointless extra information when all I want is to see the details of a car.

Often Content Management Systems – which are designed to make the creation of websites easier – are one of the main culprits in creating bad URLs. Brent Simmons has it pinned with this insightful comment:

Brent’s Law of CMS URLs: the more expensive the CMS, the crappier the URLs.

The example given is StoryServer by Vignette which produces the bizarre looking:

http://news.sky.com/skynews/article/0,,30200-1303092,00.html

I’m fairly sure they don’t have 302,001,303,092 stories on Sky News!

That’s all for now – next time I’ll be looking at some things being done right and the benefits it brings. If you have any examples of really bad URLs post them in a comment (that’s not an invitation to spammers!) and see who can find an example with the most bad features.

>