Quiet Mumblings

Tuesday, August 15, 2006

Regular Expressions for an SQL programmer

Regular expressions are one of these things that I've managed to put off for the whole of my career until now. Mainly of use in Linux/Unix environments, regex syntax is at first glance highly complex, unfriendly and not easily readable. For example:

[-+]?[0-9]*\.?[0-9]+

Like an SQL query's LIKE clause but much more powerful, regular expressions allow you to match strings on very complex criteria. Also like SQL, there is a whole science and art to optimising them to perform quickly and efficiently.

An example of an SQL query's WHERE clause might be:

first_name LIKE 'mich%l'

This query will of course match "michael", "micheal", "michel", "michekevekolol", "mich12l" etc. In a very simple equivalent statement (but not the most efficient) you could write the regular expression:

mich.+l

The . is used to represent any character (but only one instance of it), while the + is used to represent a "repetition", which means that any number of the previous characters are ok. This is effectively the same as writing a regular expression for:

mich.l
mich..l
mich...l
mich....l
mich.....l
etc.


The great thing about regular expressions is you can be much more specific than that, by telling the expression exactly what you want to match. To do this, you can use option blocks, which are denoted by square brackets: []

For example the regular expression "[abc]" would match any string that contained the letters a, b or c. It would match "michael", "christopher" and "barry", but it would not match "helen". Within these option blocks you can also specify ranges, for example "[a-zA-Z0-1]" would match anything that contained a number of a letter. It would match "michael", "123455", but not "!!!!!".

Combine this with the repetition control + and you have a powerful tool. For example our previous regular expression could now become:

mich[a-z]+l

This time it would match "michael", "micheal" "michel" and "michekevekolol", but not "mich12l". Alternatively, you could write:

mich[a-z][a-z]l

Which would match "michael", "micheal" and not "michel", "michekevekolol" and "mich12l".

But what if you wanted the expression to match "michael", "micheal" *and* "michel" but *not* "michekevekolol"? You could do this with the question mark, which like the + repetition character affects the character directly before it, but makes it optional. The regular expression:

micha?el

Would match "michael" and "michel". So our query:

mich[a-z][a-z]?l

Would equal "michael", "micheal", "michel". It would also of course match "michzxl", "michil" and "michpl". If you really just wanted "michael", "micheal" and "michel" you could use:

mich(ae|ea|e)l

The | pipe character is like a logical OR statement, or a switch. It will select any instance out of the multiple options ae, ea or e.

There is much more clever stuff you can do (there are books written about regular expressions), but here's a few good resources from the web:


Good luck - also check out the REGEXP function in MySQL to take your SQL to another level!

Technorati tags:

0 Comments:

Post a Comment

<< Home