Regular Expressions

Definitions

In formal language theory, a regular expression (a.k.a. regex, regexp, or r.e.), is a string that represents a regular (type-3) language.

In many programming languages, a regular expression is a string that represents a set of strings. They are sometimes called patterns because they almost always go way beyond what classic regular expressions can describe.

Basic Examples

Rather than start with technical details, we'll start with a bunch of examples

Regex What it stands for
hello {hello}
gray|grey {gray, grey}
gr(a|e)y {gray, grey}
gr[ae]y {gray, grey}
b[aeiou]bble {babble, bebble, bibble, bobble, bubble}
[b-chm-pP]at|a&b {bat, cat, hat, mat, nat, oat, pat, Pat, a&b}
colou?r {color, colour}
rege(x(es)?|xps?) {regex, regexes, regexp, regexps}
go*gle {ggle, gogle, google, gooogle, goooogle, ...}
go+gle {gogle, google, gooogle, goooogle, ...}
g(oog)+le {google, googoogle, googoogoogle, googoogoogoogle, ...}
z{3} {zzz}
z{3,6} {zzz, zzzz, zzzzz, zzzzzz}
z{3,} {zzz, zzzz, zzzzz, ...}
[Bb]rainf\*\*k {Brainf**k, brainf**k}
\d {0,1,2,3,4,5,6,7,8,9}
\d{5}(-\d{4})? a United States zip code
1\d{10} An 11-digit string starting with a 1
b..b A four-character string beginning and ending with a b (Note: depending on context, the dot stands either for "any character at all" or "any character except a newline")
Hello\nworld Hello followed by a newline followed by world
sh[^io]t sh followed by any character other than an i or o, followed by t
//[^\r\n]*[\r\n] A Java or C# slash-slash comment

A Few Technical Remarks

There are many different syntaxes for regular expressions, but in general you will see that

Using Regular Expressions

Programmers use regexes primarily to

Validation Examples

Example: return whether the string s is 0, 1, n, y, no, yes, off, on, false, or true, (case insensitive)

// Java
Pattern p = Pattern.compile("(?i:[01]|no?|y(?:es)?|o(?:ff|n)|false|true)");
Matcher m = p.matcher(s);
m.matches(); // returns true

# Perl
$s =~ m/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;

# Ruby
s =~ /^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;

# Python
p = re.compile("^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$", re.I)
re.match(p, s);

// JavaScript
s.search("/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i");

Some languages have convenience methods that let you use a string instead of a regex; these methods will compile a new pattern behind the scenes. While convenient, never use these shortcuts if you need to use the pattern over and over again!

// Java shortcut
s.matches("(?i:[01]|no?|y(?:es)?|o(?:ff|n)|false|true)");

# Perl shortcut
$s =~ m/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;

# Ruby shortcut
s =~ /^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;

// JavaScript shortcut
s.search("/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i");

Extraction Examples

// Java

# Perl
phone = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
phone =~ 'Call 555-1212 for info'

# Ruby
>> phone = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
=> /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
>> phone =~ 'Call 555-1212 for info'
=> 5
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["Call ", "555-1212", " for info", nil, nil, "555", "1212", nil]
>> phone =~ '800.221.9989'
=> 0
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["", "800.221.9989", "", "800.", "800", "221", "9989", nil]
>> phone =~ '1800.221.9989'
=> 1
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["1", "800.221.9989", "", "800.", "800", "221", "9989", nil]

# Python
p = re.compile('((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})')
re.match(p, 'Call 555-1212 for info');

// JavaScript
s.search("/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i");

Substitution Examples

Details

Characters

Character Classes

Groups

Qualifiers

Lookahead

Lookbehind

Modifiers

A modifier affects the way the rest of the regex is interpreted

ModifierMeaning
gglobal
iignore case
mmultiple line
ssingle line
xignore whitespace in the pattern
d
u

Examples:

TODO

Performance

You should know some things about how your regex engine works since two "equivalent" regexes can have drastic differences in processing speed.

TODO - show lots of examples and tips here

Advice

TODO

Language-Specific Notes

Java

Perl

Python

Ruby

Three ways to construct a Regexp object

    Regexp.new('z\d+\s*abc[xy]+$')

    /z\d+\s*abc[xy]+$'/

    %r{z\d+\s*abc[xy]+$'}
Exercise: Why is it best to use single quotes instead of double quotes in the argument to RegExp,new?

JavaScript