CS253: Software Development with C++

Spring 2021

Regular Expressions

Show Lecture.RegularExpressions as a slide show.

CS253 Regular Expressions

made at imgflip.com

Nomenclature

Regular expressions describe regular languages in formal language theory. They have the same expressive power as regular grammars.

Pattern Matching

% grep -i "Cornelius" ~cs253/pub/hamlet.txt
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

In C++

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
    if (line.find("Cornelius") != string::npos)
        cout << line << '\n';
		You, good Cornelius, and you, Voltimand,

That’s only one match. Didn’t we see more than that?

Case-independence

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
for (string line; getline(play, line); )
    if (line.find("Cornelius") != string::npos ||
        line.find("cornelius") != string::npos ||
        line.find("CORNELIUS") != string::npos)
        cout << line << '\n';
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Not satisfied

Regular expressions

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius");     // Create the pattern

for (string line; getline(play, line); )
    if (regex_search(line, r))  // Search the line
        cout << line << '\n';
		You, good Cornelius, and you, Voltimand,

OK, but it’s not case-independent.

Regular expressions

const string home = getpwnam("cs253")->pw_dir;
ifstream play(home+"/pub/hamlet.txt");
const regex r("Cornelius", regex_constants::icase);

for (string line; getline(play, line); )
    if (regex_search(line, r))
        cout << line << '\n';
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Dialects

You’ve all learned English; did you learn:

Well, same with regular expressions. There are dialects.

Regular expression dialects

The second argument to the regex ctor is a bitmask of flags. regex_constants::icase indicates a case-independent pattern. You can also specify the regular expression dialect:

FlagExplanation
regex_constants::ECMAScript   ECMAScript (Javascript) (default)
regex_constants::basicBasic POSIX
regex_constants::extendedExtended POSIX
regex_constants::awkAwk POSIX
regex_constants::grepGrep POSIX
regex_constants::egrepEgrep POSIX

Not filename patterns

I mean it!

Regular expressions are NOT filename patterns!

Regex components: Character classes

WhatDescription
.any one char but \n
[a-fxy0-9]any one of these, where - means a range
[^a-fxy0-9]any char but one of these
\d or \Ddigit: [0-9] or [^0-9]
\w or \Wword: [0-9a-zA-Z_] or [^0-9a-zA-Z_]
\s or \Sspace: [ \t\n\r\f\v] or [^ \t\n\r\f\v]

These all match exactly one character. \w does not match an entire word; you need \w+ for that. \d matches a digit (6), not a number (42).

Regex components: Repetition

WhatDescription
*0–∞ of previous (any number)
+1–∞ of previous (many)
?0–1 of previous (optional)
{17}17 of previous
{3,8}3–8 of previous
{,9}0–9 of previous
{12,}12–∞ of previous

These modify what came before. * on its own doesn’t match anything, but a* matches any number of a characters.

Regex components: Grouping

WhatDescription
|alternation
()grouping & capturing

These are used for choices. Consider (Abe|Abraham) Lincoln. Without the (), the pattern Abe|Abraham Lincoln would match either “Abe” or “Abraham Lincoln”, but not the whole string “Abe Lincoln”.

You can also refer back to the text captured by () with \1, \2, …. For example, ([a-z])\1 matches doubled letters. This is called a backreference.

Regex components: Assertions

WhatDescription
\b or \Bword boundary or not
^beginning of line
$end of line

These match a zero-length string, but only at certain places. ^ matches a zero-length string at the start of a line (string). It does not match the first character.

\b matches the beginning or end of a word, that is, the transition between \w and \W, or between \W and \w.

Regex components: Inherited from string syntax

WhatDescription
\ttab
\nnewline
\vvertical tab
\fform feed
\rcarriage return
\0digitsoctal number
\xdigitshexadecimal number
\udigitsUnicode code point

Examples

PatternWhat it matches    Explanation
babracadabra Take the first match
acabracadabraA plain-text string matches itself
^abraabracadabra^ matches start of string/line
abra$abracadabra$ matches end of string/line
ca.abracadabraAny single character
r.*babracadabra* modifies . to match any string (greedy)
ac.+aabracadabra+ must match at least one
cx?aabracadabra? matches zero or one

Examples

PatternWhat it matches
[a-fXY0-9]My dog has fleas.
[^a-fXY0-9]Your dog has fleas.
flea|tickMy dog has fleas.
(My|Your) (dog|cat)My dog has fleas.
\bDogg\bSnoop Doggy Dogg has fleas.
\dFile your 1040 form!
\sFile your 1040 form!
\w+File your 1040 form!

Construction

To use a regular expression, construct a regex object:

regex r("^(Ben(jamin)?\\s+)?Franklin$");  // double \ to get it into string

If your regular expression is syntactically incorrect, it lets you know:

regex r("abc(def");
terminate called after throwing an instance of 'std::regex_error'
  what():  Parenthesis is not closed.
SIGABRT: Aborted

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha << regex_search("123", r) << '\n';
true

Hooray, it worked!

Well, perhaps a bit more testing might be worthwhile …

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha
     << regex_search("123",    r) << '\n'
     << regex_search("ab45xy", r) << '\n'
     << regex_search("Bjarne", r) << '\n';
true
true
false

Testing—what a concept! Not very DRY, though.

Match a number

Let’s try to match a number:

const regex r("[0-9]");

for (auto s : {"123", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    true
Bjarne    false

OK, now it’s DRY. Why does ab45xy succeed?

Match a number

Add *:

const regex r("[0-9]*");

for (auto s : {"123", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    true
Bjarne    true

Huh—that got worse. Why did "Bjarne" succeed?

Match a number

Add +:

const regex r("[0-9]+");

for (auto s : {"123", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    true
Bjarne    false

At least we got rid of Bjarne.

Problem is, we haven’t told the regex that it has to match the whole line. It’s happy just matching part of the line.

Match a number

Anchored:

const regex r("^[0-9]+$");

for (auto s : {"123", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
ab45xy    false
Bjarne    false

Now it has to match the entire line, since ^ only matches at the start of the string, and $ only matches at the end of the string.

Match a number

How about floating-point?

const regex r("^[0-9]+$");

for (auto s : {"123", "45.67", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     false
ab45xy    false
Bjarne    false

Match a number

Need to add the decimal point:

const regex r("^[0-9.]+$");

for (auto s : {"123", "45.67", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
ab45xy    false
Bjarne    false

Match a number

We might be too liberal, now:

const regex r("^[0-9.]+$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         true
127.0.0.1 true
ab45xy    false
Bjarne    false

Match a number

Let’s insist on digits point digits:

const regex r("^[0-9]+\\.[0-9]+$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       false
45.67     true
78.       false
.89       false
.         false
127.0.0.1 false
ab45xy    false
Bjarne    false

Why the double backslash?

Match a number

No, the parts should be optional:

const regex r("^[0-9]*\\.?[0-9]*$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         true
127.0.0.1 false
ab45xy    false
Bjarne    false

Match a number

Let’s stop hacking and design.

We express alternation with |.

Match a number

const regex r("^([0-9]+|[0-9]+\\.[0-9]*|\\.[0-9]+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Bjarne    false

Match a number

Combine the first two cases:

const regex r("^([0-9]+(\\.[0-9]*)?|\\.[0-9]+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Bjarne    false

Match a number

Let’s use \d instead of [0-9]:

const regex r("^(\\d+(\\.\\d*)?|\\.\\d+)$");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Bjarne    false

Match a number

Those double backslashes are hideous. Use a raw string, which works like this:
R"(stuff-taken-literally-even-backslashes)"

const regex r(R"(^(\d+(\.\d*)?|\.\d+)$)");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_search(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Bjarne    false

Match a number

Should’ve used regex_match() instead of regex_search(); regex_match() matches the entire string. Now we don’t need ^$ and the parentheses:

const regex r(R"(\d+(\.\d*)?|\.\d+)");

for (auto s : {"123", "45.67", "78.", ".89", ".",
               "127.0.0.1", "ab45xy", "Bjarne"})
    cout << setw(10) << left << s
         << boolalpha << regex_match(s, r) << '\n';
123       true
45.67     true
78.       true
.89       true
.         false
127.0.0.1 false
ab45xy    false
Bjarne    false

Capturing Match

string in = "My dog Kokopelli is a Chihuahua-terror";
regex r("(\\S+) is a (.*)");

if (smatch sm; regex_search(in, sm, r))
    cout << "All:   " << sm[0] << '\n'
         << "Name:  " << sm[1] << '\n'
         << "Breed: " << sm[2] << '\n';
else
    cout << "No match\n";
All:   Kokopelli is a Chihuahua-terror
Name:  Kokopelli
Breed: Chihuahua-terror

Contractions

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

cout << boolalpha
     << regex_search(s, r) << '\n';
true

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
1: an't
11: y'all
30: o'clock

Match contractions

Let’s add regex_constants::icase:

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+", regex_constants::icase);

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
0: Can't
11: y'all
30: o'clock

Match contractions


Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[[:alpha:]]+'[[:alpha:]]+");  // no more icase

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
0: Can't
11: y'all
30: o'clock

Note that [[:alpha:]] is not [:alpha:]. There are two sets of square brackets. See https://cplusplus.com/reference/regex/ECMAScript/ for [[:upper:]], [[:xdigit:]], [[:punct:]], and other such character classes.

Crossword Puzzle

The website

has, believe it or not, regular expression crossword puzzles. It has to be seen to be believed!