CS253: Software Development with C++

Spring 2018

Regular Expressions

See this page as a slide show

CS253 Regular Expressions

Nomenclature

    :g/re/p
means to do a global match of all lines that match a given regular expression, and print those lines.

Wikipedia says:

Regular expressions describe regular languages in formal language theory. They have the same expressive power as regular grammars.

Pattern Matching

% grep -i "Cornelius" ~cs253/pub/hamlet.txt
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

In C++

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Cornelius") != string::npos)
        cout << line << '\n';
		You, good Cornelius, and you, Voltimand,

That’s only one match. Didn’t we find more than that?

Case-independence

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Cornelius") != string::npos ||
        line.find("cornelius") != string::npos ||
        line.find("CORNELIUS") != string::npos)
        cout << line << '\n';
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Not satisfied

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Cornelius");     // Create the pattern

while (getline(play, line))
    if (regex_search(line, r))  // Search the line
        cout << line << '\n';
		You, good Cornelius, and you, Voltimand,

OK, but it’s not case-independent.

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Cornelius", regex_constants::icase);

while (getline(play, line))
    if (regex_search(line, r))
        cout << line << '\n';
CORNELIUS	|
		POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords,
		You, good Cornelius, and you, Voltimand,
CORNELIUS	|
		[Exeunt VOLTIMAND and CORNELIUS]
		[Re-enter POLONIUS, with VOLTIMAND and CORNELIUS]
		[Exeunt VOLTIMAND and CORNELIUS]

Dialects

You’ve all learned English; did you learn:

Well, same with regular expressions. There are dialects.

Basic components of regular expressions:

What Description What Description
. any one char but \n | alternation
[a-fxy0-9] any one of these () grouping
[^a-fxy0-9] not one of these \b word boundary
* 0–∞ of previous \d or \D [0-9] or not
+ 1–∞ of previous \s or \S [ \n\r…] or not
? 0–1 of previous \w or \W [0-9a-zA-Z] or not
{17} 17 of previous ^ beginning of line
{3,8} 3–8 of previous $ end of line

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false

Match a number

Add *:

const regex r("[0-9]*");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true

Huh—that got worse. Why did "Jack" succeed?

Match a number

Add +:

const regex r("[0-9]+");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false

At least we got rid of Jack.

Match a number

Anchored:

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
false
false

Match a number

How about floating-point?

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
false
false
false

Match a number

Need to add the decimal point:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false
false

Match a number

We might be too liberal, now:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
true
true
false
false

Match a number

Let’s insist on digits point digits:

const regex r("^[0-9]+\\.[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
false
true
false
false
false
false
false
false

Why the double backslash?

Match a number

No, the parts should be optional:

const regex r("^[0-9]*\\.?[0-9]*$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
true
false
false
false

Match a number

We express alternation with |.

Match a number

const regex r("^([0-9]+|[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Combine the first two cases:

const regex r("^([0-9]+(\\.[0-9]*)?|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Let’s use \d instead of [0-9]:

const regex r("^(\\d+(\\.\\d*)?|\\d*\\.\\d+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Those double backslashes are hideous. Use a raw string:

const regex r(R"(^(\d+(\.\d*)?|\d*\.\d+)$)");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Should’ve used regex_match instead of regex_search; regex_match matches the entire string. Now we don’t need ^$ and the parentheses:

const regex r(R"(\d+(\.\d*)?|\d*\.\d+)");

cout << boolalpha
     << regex_match("123",       r) << '\n'
     << regex_match("45.67",     r) << '\n'
     << regex_match("78.",       r) << '\n'
     << regex_match(".89",       r) << '\n'
     << regex_match(".",         r) << '\n'
     << regex_match("127.0.0.1", r) << '\n'
     << regex_match("abc123def", r) << '\n'
     << regex_match("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Change of topic

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

cout << boolalpha
     << regex_search(s, r) << '\n';
true

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
1: an't
11: y'all
30: o'clock

Exactly what is iter iterating over?

Match contractions

Match contractions

const string s = "Can't feed y'all before three o'clock!";
const regex r("[[:alpha:]]+'[[:alpha:]]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
0: Can't
11: y'all
30: o'clock

User: Guest

Check: HTML CSS
Edit History Source

Modified: 2018-04-25T14:34

Apply to CSU | Contact CSU | Disclaimer | Equal Opportunity
Colorado State University, Fort Collins, CO 80523 USA
© 2018 Colorado State University
CS Building