In formal language theory, a regular expression (a.k.a. regex, regexp, or r.e.), is a string that represents a regular (type-3) language.
In many programming languages, a regular expression is a string that represents a set of strings. They are sometimes called patterns because they almost always go way beyond what classic regular expressions can describe.
Rather than start with technical details, we'll start with a bunch of examples
| Regex | What it stands for |
|---|---|
| hello | {hello} |
| gray|grey | {gray, grey} |
| gr(a|e)y | {gray, grey} |
| gr[ae]y | {gray, grey} |
| b[aeiou]bble | {babble, bebble, bibble, bobble, bubble} |
| [b-chm-pP]at|a&b | {bat, cat, hat, mat, nat, oat, pat, Pat, a&b} |
| colou?r | {color, colour} |
| rege(x(es)?|xps?) | {regex, regexes, regexp, regexps} |
| go*gle | {ggle, gogle, google, gooogle, goooogle, ...} |
| go+gle | {gogle, google, gooogle, goooogle, ...} |
| g(oog)+le | {google, googoogle, googoogoogle, googoogoogoogle, ...} |
| z{3} | {zzz} |
| z{3,6} | {zzz, zzzz, zzzzz, zzzzzz} |
| z{3,} | {zzz, zzzz, zzzzz, ...} |
| [Bb]rainf\*\*k | {Brainf**k, brainf**k} |
| \d | {0,1,2,3,4,5,6,7,8,9} |
| \d{5}(-\d{4})? | a United States zip code |
| 1\d{10} | An 11-digit string starting with a 1 |
| b..b | A four-character string beginning and ending with a b (Note: depending on context, the dot stands either for "any character at all" or "any character except a newline") |
| Hello\nworld | Hello followed by a newline followed by world |
| sh[^io]t | sh followed by any character other than an i or o, followed by t |
| //[^\r\n]*[\r\n] | A Java or C# slash-slash comment |
There are many different syntaxes for regular expressions, but in general you will see that
( ) [ { ^ $ . \ ? * + |
Programmers use regexes primarily to
Example: return whether the string s is 0, 1, n, y, no, yes, off, on, false, or true, (case insensitive)
// Java
Pattern p = Pattern.compile("(?i:[01]|no?|y(?:es)?|o(?:ff|n)|false|true)");
Matcher m = p.matcher(s);
m.matches(); // returns true
# Perl
$s =~ m/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;
# Ruby
s =~ /^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;
# Python
p = re.compile("^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$", re.I)
re.match(p, s);
// JavaScript
s.search("/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i");
Some languages have convenience methods that let you use a string instead of a regex; these methods will compile a new pattern behind the scenes. While convenient, never use these shortcuts if you need to use the pattern over and over again!
// Java shortcut
s.matches("(?i:[01]|no?|y(?:es)?|o(?:ff|n)|false|true)");
# Perl shortcut
$s =~ m/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;
# Ruby shortcut
s =~ /^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i;
// JavaScript shortcut
s.search("/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i");
// Java
# Perl
phone = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
phone =~ 'Call 555-1212 for info'
# Ruby
>> phone = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
=> /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
>> phone =~ 'Call 555-1212 for info'
=> 5
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["Call ", "555-1212", " for info", nil, nil, "555", "1212", nil]
>> phone =~ '800.221.9989'
=> 0
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["", "800.221.9989", "", "800.", "800", "221", "9989", nil]
>> phone =~ '1800.221.9989'
=> 1
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["1", "800.221.9989", "", "800.", "800", "221", "9989", nil]
# Python
p = re.compile('((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})')
re.match(p, 'Call 555-1212 for info');
// JavaScript
s.search("/^[01]|no?|y(?:es)?|o(?:ff|n)|false|true$/i");
A modifier affects the way the rest of the regex is interpreted
| Modifier | Meaning |
|---|---|
| g | global |
| i | ignore case |
| m | multiple line |
| s | single line |
| x | ignore whitespace in the pattern |
| d | |
| u |
Examples:
TODOYou should know some things about how your regex engine works since two "equivalent" regexes can have drastic differences in processing speed.
TODO - show lots of examples and tips here
Three ways to construct a Regexp object
Regexp.new('z\d+\s*abc[xy]+$')
/z\d+\s*abc[xy]+$'/
%r{z\d+\s*abc[xy]+$'}