For me regular expressions are like magic. I can appreciate the wonderful things they can do but I’ve never really understood how the trick works. As I develop primarily in PHP there are plenty of PHP string functions which when combined, can get me the desired result. However, yesterday I had to extract two values from a single string, so I thought I’d give Regular Expressions another go. Over the years of sniffing at them I realised that I’d picked up a few things.
- Word Boundaries – These look different depending what flavour of regex you’re using. I’m using PHP’s PCRE functions so my word boudaries are forward slashes
- Characters – I knew about Character Classes and I know what some do, not all of them. For example I know that [A-Za-z] will find an alpahbetical character and \d will find a single digit.
- Repetition – I knew about greediness ( the + character). This character will try to continue matching your token after its found the first occurence.
So when I was confronted with the following string: item_newsArticle1138, I decided that it would just extend my knowledge enough to investigate using regex to extract “newsArticle” and “1138″
Getting the number off the end of the string was easy enough. I’m using the preg_match() function in php to pass the found string into a variable:
preg_match('/\d+/', 'item_newsArticle1138', $modelid);
The forward slashes act as regex delimters, so all we need is \d+ which finds the first digit in the string and the plus sign (+) continues finding subsequent digits: 1138.
Then came the tough one. If I use /[A-Za-z]+ as my regex;
preg_match('/[A-Za-z]+/', 'item_newsArticle1138, $model);
The regex engine reports finding “item” in the string and stops at the underscore, ignoring the part I needed. What I needed the regex to do was find the underscore and then get all the alphabetic charaters after it. To do this I needed to learn how to use positive and negative lookahead and lookbehind.
So what I needed to add to my string was a positive lookbehind: (?<=_). The brackets and the question mark provide the code for the lookahead/lookbehind. The inclusion of a less than makes it a lookbehind (N.B. omit the less than symbol for lookahead - not a greater than symbol). The equals sign tells the regex "to look for" and then I pass in the underscore character, as that's what I'm looking for in the string.
That gives us the following regex:
preg_match(‘/(?<=_)[A-Za-z]+/', 'item_newsArticle1138, $model);
Be careful the php function preg_match returns the results in the variable declared as the last arguement ($model) as an array, so to use the value you need $model to return “newsArticle”.
Hey presto, the magic is revealed. I’m no expert, so somebody’s bound to tell me a better way in the comments, hello?