Regular Expressions

Regular Expressions are way cool. Knowledge of regexes will allow you to save the day.

Definitions

In formal language theory, a regular expression (a.k.a. regex, regexp, or r.e.), is a string that represents a regular (type-3) language.

Huh??

Okay, in many programming languages, a regular expression is a pattern that matches strings or pieces of strings. (Incidentally, the set of strings they are capable of matching goes way beyond what regular expressions from language theory can describe.)

Basic Examples

Rather than start with technical details, we'll start with a bunch of examples.

Regex Matches any string that
hello contains {hello}
gray|grey contains {gray, grey}
gr(a|e)y contains {gray, grey}
gr[ae]y contains {gray, grey}
b[aeiou]bble contains {babble, bebble, bibble, bobble, bubble}
[b-chm-pP]at|ot contains {bat, cat, hat, mat, nat, oat, pat, Pat, ot}
colou?r contains {color, colour}
rege(x(es)?|xps?) contains {regex, regexes, regexp, regexps}
[01]|no?|y(es)?|o(ff|n)|false|true contains {0, 1, n, no, y, yes, off, on, false, true}
go*gle contains {ggle, gogle, google, gooogle, goooogle, ...}
go+gle contains {gogle, google, gooogle, goooogle, ...}
g(oog)+le contains {google, googoogle, googoogoogle, googoogoogoogle, ...}
z{3} contains {zzz}
z{3,6} contains {zzz, zzzz, zzzzz, zzzzzz}
z{3,} contains {zzz, zzzz, zzzzz, ...}
[Bb]rainf\*\*k contains {Brainf**k, brainf**k}
\d contains {0,1,2,3,4,5,6,7,8,9}
\d{5}(-\d{4})? contains a United States zip code
1\d{10} contains an 11-digit string starting with a 1
[2-9]|[12]\d|3[0-6] contains an integer in the range 2..36 inclusive
Hello\nworld contains Hello followed by a newline followed by world
b..b contains a four-character (sub)string beginning and ending with a b (Note: depending on context, the dot stands either for "any character at all" or "any character except a newline")
\d+(\.\d\d)? contains a positive integer or a floating point number with exactly two characters after the decimal point.
sh[^io]t contains sh followed by any character other than an i or o, followed by t
//[^\r\n]*[\r\n] contains a Java or C# slash-slash comment
^dog begins with "dog"
dog$ ends with "dog"
^dog$ is exactly "dog"

Notation

There are many different syntaxes for regular expressions, but in general you will see that

Using Regular Expressions

Many languages allow programmers to define regexes and then use them to:

Generally a regex is first compiled into some internal form that can be used for super fast validation, extraction, and replacing. Sometimes there is an explicit "compile" function or method, and sometimes special syntax is used to compile, such as the very common form /.../.

Validation

Example: find "color" or "colour" in a given string.

// Java
Pattern p = Pattern.compile("colou?r");
Matcher m = p.matcher("The color green");
m.find();                           // returns true
m.start();                          // returns 4
m.end();                            // returns 9
m = matcher("abc");
m.find();                           // returns false

# Perl
$p = /colou?r/;
"The color green" =~ $p;            # returns 1 (cuz no Perl true)
"abc" =~ $p;                        # returns 0 (cuz no Perl false)

# Ruby
p = /colou?r/
"The color green" =~ p              # returns 4
"abc" =~ p                          # returns nil

# Python
p = re.compile("colou?r")
m = p.search("The color green")
m.start()                           # returns 4
m = p.search("abc")                 # returns None

// JavaScript
var p = /colou?r/;
"The color green".search(p);        // returns 4
"abc".search(p);                    // returns -1

If you want to know if an entire string matches a pattern, define the pattern with ^ and $, or with \A and \Z. In Java, you can call matches() instead of find().

Extraction

After doing a match against a pattern, most regex engines will return you a bundle of information, including the part of the text that matched the pattern, the index within the string where the match begins, the text before the matched text, and the text after the matched text.

If you put parentheses around parts of the text, the engine will extract the matched pieces. Extraction is very expensive! If you need parentheses but you do not need to capture, you should use a non-capturing group.

# Ruby
>> phone = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
=> /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
>> phone =~ 'Call 555-1212 for info'
=> 5
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["Call ", "555-1212", " for info", nil, nil, "555", "1212", nil]
>> phone =~ '800.221.9989'
=> 0
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["", "800.221.9989", "", "800.", "800", "221", "9989", nil]
>> phone =~ '1800.221.9989'
=> 1
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["1", "800.221.9989", "", "800.", "800", "221", "9989", nil]

// JavaScript
s.search("/((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})");
TODO

// Java
TODO

Substitution

Advanced Stuff

Character Classes

Groups

Qualifiers

Lookahead

Lookbehind

Backreferences

Modifiers

A modifier affects the way the rest of the regex is interpreted

ModifierMeaning
gglobal
iignore case
mmultiple line
ssingle line (DOTALL): Means that the dot matches any character at all. Without this modifier, the dot matches any character except a newline.
xignore whitespace in the pattern
dUnix line mode: Considers only U+000A as a line separator, rather than U+000D or the U+00)D/U+000A combo or even U+2028.
uUnicode case: in this mode the case-insensitive modifier respects Unicode cases; outside of this mode that modifier only consolidates cases of US-ASCII characters.

Examples:

TODO

Performance

You should know some things about how your regex engine works since two "equivalent" regexes can have drastic differences in processing speed.

Here are some things that are generally evil:

Temporary Compilations

Some languages have convenience methods that let you use a string instead of a regex; these methods will compile a new pattern behind the scenes. While convenient, never use these shortcuts if you need to use the pattern over and over again!

// Java shortcut, should not be used in most cirumstances
s.matches("colou?r");

Nested Repetition

TODO

Dot-Star

TODO

TODO - show lots of examples and tips here

Language-Specific Notes

Java

Perl

Python

Ruby

There are three ways to construct a Regexp object

    Regexp.new('z\d+\s*abc[xy]+$')

    /z\d+\s*abc[xy]+$'/

    %r{z\d+\s*abc[xy]+$'}
Exercise: Why is it best to use single quotes instead of double quotes in the argument to RegExp.new?

JavaScript

Two ways to create:

re = /a+bc/;
re = new Regexp("a+bc");

JavaScript doesn't have as rich a regex language as most other languages, but it has enough for most cases. The complete reference is here.

Methods:

regex.exec(str)
If there's a match, returns an array of match info. If no match, returns null.
regex.test(str)
Simply returns true if there's a match and false otherwise.
str.match(regex)
Without the g modifier, same as regex.exec(str). With the g modifier, returns an array of all matches.
str.search(regex)
Returns the index of the beginning of the match, or -1 if no match.
str.replace(regex, newTextOrFunction)
Replaces the matched part of the string with new text.
str.split(regex)
Returns an array of substrings split by the regex separator.

Details

Here are some good sources: