Show Lecture.RegularExpressions as a slide show.
grep
. In vi:
:g/re/p
Wikipedia says:
Regular expressions describe regular languages in formal language theory. They have the same expressive power as regular grammars.
% grep -i "Cornelius" ~cs253/pub/hamlet.txt CORNELIUS | POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, You, good Cornelius, and you, Voltimand, CORNELIUS | [Exeunt VOLTIMAND and CORNELIUS] [Re-enter POLONIUS, with VOLTIMAND and CORNELIUS] [Exeunt VOLTIMAND and CORNELIUS]
const string home = getpwnam("cs253")->pw_dir; ifstream play(home+"/pub/hamlet.txt"); string line; while (getline(play, line)) if (line.find("Cornelius") != string::npos) cout << line << '\n';
You, good Cornelius, and you, Voltimand,
That’s only one match. Didn’t we find more than that?
const string home = getpwnam("cs253")->pw_dir; ifstream play(home+"/pub/hamlet.txt"); string line; while (getline(play, line)) if (line.find("Cornelius") != string::npos || line.find("cornelius") != string::npos || line.find("CORNELIUS") != string::npos) cout << line << '\n';
CORNELIUS | POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, You, good Cornelius, and you, Voltimand, CORNELIUS | [Exeunt VOLTIMAND and CORNELIUS] [Re-enter POLONIUS, with VOLTIMAND and CORNELIUS] [Exeunt VOLTIMAND and CORNELIUS]
const string home = getpwnam("cs253")->pw_dir; ifstream play(home+"/pub/hamlet.txt"); string line; const regex r("Cornelius"); // Create the pattern while (getline(play, line)) if (regex_search(line, r)) // Search the line cout << line << '\n';
You, good Cornelius, and you, Voltimand,
OK, but it’s not case-independent.
const string home = getpwnam("cs253")->pw_dir; ifstream play(home+"/pub/hamlet.txt"); string line; const regex r("Cornelius", regex_constants::icase); while (getline(play, line)) if (regex_search(line, r)) cout << line << '\n';
CORNELIUS | POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, You, good Cornelius, and you, Voltimand, CORNELIUS | [Exeunt VOLTIMAND and CORNELIUS] [Re-enter POLONIUS, with VOLTIMAND and CORNELIUS] [Exeunt VOLTIMAND and CORNELIUS]
You’ve all learned English; did you learn:
Well, same with regular expressions. There are dialects.
What | Description | What | Description |
---|---|---|---|
.
| any one char but \n | |
| alternation |
[a-fxy0-9]
| any one of these | ( …)
| grouping |
[^a-fxy0-9]
| not one of these | \b
| word boundary |
*
| 0–∞ of previous | \d or \D
| [0-9] or not |
+
| 1–∞ of previous | \s or \S
| [ \n\r…] or not
|
?
| 0–1 of previous | \w or \W
| [0-9a-zA-Z_] or not |
{17}
| 17 of previous | ^
| beginning of line |
{3,8}
| 3–8 of previous | $
| end of line |
Let’s try to match a number:
const regex r("[0-9]"); cout << boolalpha << regex_search("123", r) << '\n';
true
Hooray, it worked!
Well, perhaps a bit more testing might be worthwhile …
Let’s try to match a number:
const regex r("[0-9]"); cout << boolalpha << regex_search("123", r) << '\n' << regex_search("ab45xy", r) << '\n' << regex_search("Jack", r) << '\n';
true true false
Testing—what a concept! Not very DRY, though.
Let’s try to match a number:
const regex r("[0-9]"); for (auto s : {"123", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true ab45xy true Jack false
OK, now it’s DRY. Why does ab45xy
succeed?
Add *
:
const regex r("[0-9]*"); for (auto s : {"123", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true ab45xy true Jack true
Huh—that got worse. Why did "Jack"
succeed?
Add +
:
const regex r("[0-9]+"); for (auto s : {"123", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true ab45xy true Jack false
At least we got rid of Jack.
Anchored:
const regex r("^[0-9]+$"); for (auto s : {"123", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true ab45xy false Jack false
How about floating-point?
const regex r("^[0-9]+$"); for (auto s : {"123", "45.67", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 false ab45xy false Jack false
Need to add the decimal point:
const regex r("^[0-9.]+$"); for (auto s : {"123", "45.67", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 true ab45xy false Jack false
We might be too liberal, now:
const regex r("^[0-9.]+$"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 true 78. true .89 true . true 127.0.0.1 true ab45xy false Jack false
Let’s insist on digits point digits:
const regex r("^[0-9]+\\.[0-9]+$"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 false 45.67 true 78. false .89 false . false 127.0.0.1 false ab45xy false Jack false
Why the double backslash?
No, the parts should be optional:
const regex r("^[0-9]*\\.?[0-9]*$"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 true 78. true .89 true . true 127.0.0.1 false ab45xy false Jack false
Let’s stop hacking and design.
.
are optional, but a naked .
is bad,
so here are the possibilities:
We express alternation with |
.
const regex r("^([0-9]+|[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+)$"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 true 78. true .89 true . false 127.0.0.1 false ab45xy false Jack false
Combine the first two cases:
const regex r("^([0-9]+(\\.[0-9]*)?|[0-9]*\\.[0-9]+)$"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 true 78. true .89 true . false 127.0.0.1 false ab45xy false Jack false
Let’s use \d
instead of [0-9]
:
const regex r("^(\\d+(\\.\\d*)?|\\d*\\.\\d+)$"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 true 78. true .89 true . false 127.0.0.1 false ab45xy false Jack false
Those double backslashes are hideous. Use a raw string:
const regex r(R"(^(\d+(\.\d*)?|\d*\.\d+)$)"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_search(s, r) << '\n';
123 true 45.67 true 78. true .89 true . false 127.0.0.1 false ab45xy false Jack false
Should’ve used regex_match
instead of regex_search
;
regex_match
matches the entire string.
Now we don’t need ^$
and the parentheses:
const regex r(R"(\d+(\.\d*)?|\d*\.\d+)"); for (auto s : {"123", "45.67", "78.", ".89", ".", "127.0.0.1", "ab45xy", "Jack"}) cout << setw(10) << left << s << boolalpha << regex_match(s, r) << '\n';
123 true 45.67 true 78. true .89 true . false 127.0.0.1 false ab45xy false Jack false
const string s = "Can't feed y'all before three o'clock!"; const regex r("[a-z]+'[a-z]+"); cout << boolalpha << regex_search(s, r) << '\n';
true
const string s = "Can't feed y'all before three o'clock!"; const regex r("[a-z]+'[a-z]+"); sregex_iterator iter(s.begin(), s.end(), r); sregex_iterator end; for (; iter!=end; ++iter) cout << iter->position() << ": " << iter->str() << '\n';
1: an't 11: y'all 30: o'clock
iter
iterating over?
Let’s add regex_constants::icase
:
const string s = "Can't feed y'all before three o'clock!"; const regex r("[a-z]+'[a-z]+", regex_constants::icase); sregex_iterator iter(s.begin(), s.end(), r); sregex_iterator end; for (; iter!=end; ++iter) cout << iter->position() << ": " << iter->str() << '\n';
0: Can't 11: y'all 30: o'clock
Oh, good—that didn’t help at all. Why not!?
[a-z]
is a problem. What does it mean?
const string s = "Can't feed y'all before three o'clock!"; const regex r("[[:alpha:]]+'[[:alpha:]]+"); // no more icase sregex_iterator iter(s.begin(), s.end(), r); sregex_iterator end; for (; iter!=end; ++iter) cout << iter->position() << ": " << iter->str() << '\n';
0: Can't 11: y'all 30: o'clock
Note that [[:alpha:]]
is not [:alpha:]
.
There are two sets of square brackets.
See http://www.cplusplus.com/reference/regex/ECMAScript/
for other such things.