In formal language theory, a regular expression (a.k.a. regex, regexp, or r.e.), is a string that represents a regular (type-3) language.
Huh??
Okay, in many programming languages, a regular expression is a pattern that matches strings or pieces of strings. (Incidentally, the set of strings they are capable of matching goes way beyond what regular expressions from language theory can describe.)
Rather than start with technical details, we'll start with a bunch of examples.
| Regex | Matches any string that |
|---|---|
| hello | contains {hello} |
| gray|grey | contains {gray, grey} |
| gr(a|e)y | contains {gray, grey} |
| gr[ae]y | contains {gray, grey} |
| b[aeiou]bble | contains {babble, bebble, bibble, bobble, bubble} |
| [b-chm-pP]at|ot | contains {bat, cat, hat, mat, nat, oat, pat, Pat, ot} |
| colou?r | contains {color, colour} |
| rege(x(es)?|xps?) | contains {regex, regexes, regexp, regexps} |
| [01]|no?|y(es)?|o(ff|n)|false|true | contains {0, 1, n, no, y, yes, off, on, false, true} |
| go*gle | contains {ggle, gogle, google, gooogle, goooogle, ...} |
| go+gle | contains {gogle, google, gooogle, goooogle, ...} |
| g(oog)+le | contains {google, googoogle, googoogoogle, googoogoogoogle, ...} |
| z{3} | contains {zzz} |
| z{3,6} | contains {zzz, zzzz, zzzzz, zzzzzz} |
| z{3,} | contains {zzz, zzzz, zzzzz, ...} |
| [Bb]rainf\*\*k | contains {Brainf**k, brainf**k} |
| \d | contains {0,1,2,3,4,5,6,7,8,9} |
| \d{5}(-\d{4})? | contains a United States zip code |
| 1\d{10} | contains an 11-digit string starting with a 1 |
| [2-9]|[12]\d|3[0-6] | contains an integer in the range 2..36 inclusive |
| Hello\nworld | contains Hello followed by a newline followed by world |
| b..b | contains a four-character (sub)string beginning and ending with a b (Note: depending on context, the dot stands either for "any character at all" or "any character except a newline") |
| \d+(\.\d\d)? | contains a positive integer or a floating point number with exactly two characters after the decimal point. |
| sh[^io]t | contains sh followed by any character other than an i or o, followed by t |
| //[^\r\n]*[\r\n] | contains a Java or C# slash-slash comment |
| ^dog | begins with "dog" |
| dog$ | ends with "dog" |
| ^dog$ | is exactly "dog" |
There are many different syntaxes for regular expressions, but in general you will see that
( ) [ { ^ $ . \ ? * + |
Though in some languages there may be more of these.
Many languages allow programmers to define regexes and then use them to:
Generally a regex is first compiled into some internal form that can be used for super fast validation, extraction, and replacing. Sometimes there is an explicit "compile" function or method, and sometimes special syntax is used to compile, such as the very common form /.../.
Example: find "color" or "colour" in a given string.
// Java
Pattern p = Pattern.compile("colou?r");
Matcher m = p.matcher("The color green");
m.find(); // returns true
m.start(); // returns 4
m.end(); // returns 9
m = matcher("abc");
m.find(); // returns false
# Perl
$p = /colou?r/;
"The color green" =~ $p; # returns 1 (cuz no Perl true)
"abc" =~ $p; # returns 0 (cuz no Perl false)
# Ruby
p = /colou?r/
"The color green" =~ p # returns 4
"abc" =~ p # returns nil
# Python
p = re.compile("colou?r")
m = p.search("The color green")
m.start() # returns 4
m = p.search("abc") # returns None
// JavaScript
var p = /colou?r/;
"The color green".search(p); // returns 4
"abc".search(p); // returns -1
If you want to know if an entire string matches a pattern, define the pattern with ^ and $, or with \A and \Z. In Java, you can call matches() instead of find().
After doing a match against a pattern, most regex engines will return you a bundle of information, including the part of the text that matched the pattern, the index within the string where the match begins, the text before the matched text, and the text after the matched text.
If you put parentheses around parts of the text, the engine will extract the matched pieces. Extraction is very expensive! If you need parentheses but you do not need to capture, you should use a non-capturing group.
# Ruby
>> phone = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
=> /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
>> phone =~ 'Call 555-1212 for info'
=> 5
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["Call ", "555-1212", " for info", nil, nil, "555", "1212", nil]
>> phone =~ '800.221.9989'
=> 0
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["", "800.221.9989", "", "800.", "800", "221", "9989", nil]
>> phone =~ '1800.221.9989'
=> 1
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["1", "800.221.9989", "", "800.", "800", "221", "9989", nil]
// JavaScript
s.search("/((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})");
TODO
// Java
TODO
A modifier affects the way the rest of the regex is interpreted
| Modifier | Meaning |
|---|---|
| g | global |
| i | ignore case |
| m | multiple line |
| s | single line (DOTALL): Means that the dot matches any character at all. Without this modifier, the dot matches any character except a newline. |
| x | ignore whitespace in the pattern |
| d | Unix line mode: Considers only U+000A as a line separator, rather than U+000D or the U+00)D/U+000A combo or even U+2028. |
| u | Unicode case: in this mode the case-insensitive modifier respects Unicode cases; outside of this mode that modifier only consolidates cases of US-ASCII characters. |
Examples:
TODOYou should know some things about how your regex engine works since two "equivalent" regexes can have drastic differences in processing speed.
Here are some things that are generally evil:
Some languages have convenience methods that let you use a string instead of a regex; these methods will compile a new pattern behind the scenes. While convenient, never use these shortcuts if you need to use the pattern over and over again!
// Java shortcut, should not be used in most cirumstances
s.matches("colou?r");
TODO
TODO
TODO - show lots of examples and tips here
There are three ways to construct a Regexp object
Regexp.new('z\d+\s*abc[xy]+$')
/z\d+\s*abc[xy]+$'/
%r{z\d+\s*abc[xy]+$'}
Two ways to create:
re = /a+bc/;
re = new Regexp("a+bc");
JavaScript doesn't have as rich a regex language as most other languages, but it has enough for most cases. The complete reference is here.
Methods:
Here are some good sources: