A Meek Defense of regex

This was originally a comment on the post The road to hell is paved with regular expressions.

It’s kinda popular to bash regex. There's that JWZ Classic:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

and there's plenty of modern equivalents. But I think that Occasionally; just occasionally, it’s the best tool for the job. If you're trying to dole out sage advice to beginners, I think it's better to give a balanced view, rather than instilling within them a "regex is evil" dogma that they will take ages to outgrow.

Let’s consider a function which extracts numbers from a string, which can have other characters in it; say for detecting a phone number. We can make two strings to see how our functions can perform in short-string and long-string conditions, just to see how regex and non-regex approaches work in different scenarios:

s = "My phone number is (123)456-7890."
ls = "My phone number is" * 5000 + "(123)456-7890."
len(s), len(ls)
# (33, 90014)

Let’s make 4 different functions, to see how different things might cause overhead here. The first uses isdigit(), the second compares chars as ints and will rely on the way the string is encoded, the third uses pre-compiled regex, and the 4th on-the-fly regex.

import re
def pextract(s):
    return ''.join(c for c in s if c.isdigit())

def pextract_n(s):
    return ''.join(c for c in s if c >= '0' and c <= '9')

pat = re.compile(r"(\d+)")
def pextract_re(s):
    return ''.join(pat.findall(s))

def pextract_rec(s):
    return ''.join(re.findall(r"(\d+)", s))

Lets see what timeit shows us:

timeit pextract(s) 100000 loops, best of 3: 4.78 us per loop
timeit pextract(ls) 100 loops, best of 3: 8.01 ms per loop
 
timeit pextract_n(s) 100000 loops, best of 3: 4.11 us per loop
timeit pextract_n(ls) 100 loops, best of 3: 7.19 ms per loop
 
timeit pextract_re(s) 100000 loops, best of 3: 2.18 us per loop
timeit pextract_re(ls) 100 loops, best of 3: 2.98 ms per loop

timeit pextract_rec(s) 100000 loops, best of 3: 3.19 us per loop
timeit pextract_rec(ls) 100 loops, best of 3: 2.89 ms per loop

It shouldn’t be surprising to people who have lots of experience with python that the regex module is faster: looping over things in python is slow, and since the regex code doesn’t have to loop much, it runs noticeably faster. It might be surprising that even with the compilation overhead, re outperforms doing this task manually (and it was surprising even to me that non-compilation fared better on the longer string; I think this might be the cost of going out of scope for the pattern).

If I was writing this code, I’d write the first function. Unless it was absolutely required, I would not resort to the regex versions. However, I don’t think the regex version is particularly ugly, and this is a really straight forward filter.

If you imagine a more complex but equally contrived example, say you wanted to get the sum of the possibly-floating-point numbers mentioned in the body of a text. This is a more complex regex, and might appear ugly to many, but in fact a naive but still quite decent version ends up being rather short and concise:

def sum_floats(s):
    return sum(map(float, re.findall(r"(\d+(?:\.\d+)?)", s)))

This code has faults: it won't recognize negative numbers, it will consider any digits in the text to be numbers, it will validate "3.4.3" as "3.4" and "3", and more. In some ways, it's the poster child for why regex are evil; if these issues must be fixed, the regex will become monstrously complex. This is the reason you should shy away from regex.

However, the equivalent function that twiddles with str.split and has to keep track of the run of digits and decimals and such is much longer, less clear, and far slower. The regex version embodies all of the matching logic within the regex language; provided you can read it, you would grok the above expression a lot faster than you would the equivalent manual python code.

For little one offs, or when quick and dirty really is good enough and any better is YAGNI, regex can be an invaluable tool. If you can't read the regex above comfortably (save perhaps the non-matching group syntax) and think that it is obscure, that is your weakness, just the same as it would be if you found the code using the genexprs above code confusing. You must learn your tools better before you can understand and then, with experience, judge them.

Aug 10 2012

jmoiron plays the blues

A Meek Defense of regex