CS253

CS253: Software Development with C++

Spring 2017

Regular Expressions

See this page as a slide show

Regular Expressions

CS253 Regular Expressions

Pattern Matching

    % grep "Osric" ~cs253/pub/hamlet.txt
            Osric, who brings back to him that you attend him in
    KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
    LAERTES	Why, as a woodcock to mine own springe, Osric;

    % grep -c "Osric" ~cs253/pub/hamlet.txt
    3

    % grep -ic "Osric" ~cs253/pub/hamlet.txt
    32

In C++

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Osric") != string::npos)
        cout << line << '\n';
		Osric, who brings back to him that you attend him in
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
LAERTES		Why, as a woodcock to mine own springe, Osric;

That’s only three matches. Didn’t we find 32 matches?

Case-independence

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Osric") != string::npos ||
        line.find("osric") != string::npos ||
        line.find("OSRIC") != string::npos)
        cout << line << '\n';
OSRIC		|
		[Enter OSRIC]
OSRIC		Your lordship is right welcome back to Denmark.
OSRIC		Sweet lord, if your lordship were at leisure, I
OSRIC		I thank your lordship, it is very hot.
OSRIC		It is indifferent cold, my lord, indeed.
OSRIC		Exceedingly, my lord; it is very sultry,--as
OSRIC		Nay, good my lord; for mine ease, in good faith.
OSRIC		Your lordship speaks most infallibly of him.
OSRIC		Sir?
OSRIC		Of Laertes?
OSRIC		I know you are not ignorant--
OSRIC		You are not ignorant of what excellence Laertes is--
OSRIC		I mean, sir, for his weapon; but in the imputation
OSRIC		Rapier and dagger.
OSRIC		The king, sir, hath wagered with him six Barbary
OSRIC		The carriages, sir, are the hangers.
OSRIC		The king, sir, hath laid, that in a dozen passes
OSRIC		I mean, my lord, the opposition of your person in trial.
OSRIC		Shall I re-deliver you e'en so?
OSRIC		I commend my duty to your lordship.
		[Exit OSRIC]
		Osric, who brings back to him that you attend him in
		Lords, OSRIC, and Attendants with foils, &c]
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
OSRIC		Ay, my good lord.
OSRIC		A hit, a very palpable hit.
OSRIC		Nothing, neither way.
OSRIC				  Look to the queen there, ho!
OSRIC		How is't, Laertes?
LAERTES		Why, as a woodcock to mine own springe, Osric;
OSRIC		Young Fortinbras, with conquest come from Poland,

Not satisfied

That’s better but it’s not truly case-independent. What about “OsRiC”, or “oSRIc”? There are 2⁵, or 32, combinations, making the code quite tedious:

if (line.find("osric") != string::npos || line.find("osriC") != string::npos ||
    line.find("osrIc") != string::npos || line.find("osrIC") != string::npos ||
    line.find("osRic") != string::npos || line.find("osRiC") != string::npos ||
    line.find("osRIc") != string::npos || line.find("osRIC") != string::npos ||
    line.find("oSric") != string::npos || line.find("oSriC") != string::npos ||
    line.find("oSrIc") != string::npos || line.find("oSrIC") != string::npos ||
    line.find("oSRic") != string::npos || line.find("oSRiC") != string::npos ||
    line.find("oSRIc") != string::npos || line.find("oSRIC") != string::npos ||
    line.find("Osric") != string::npos || line.find("OsriC") != string::npos ||
    line.find("OsrIc") != string::npos || line.find("OsrIC") != string::npos ||
    line.find("OsRic") != string::npos || line.find("OsRiC") != string::npos ||
    line.find("OsRIc") != string::npos || line.find("OsRIC") != string::npos ||
    line.find("OSric") != string::npos || line.find("OSriC") != string::npos ||
    line.find("OSrIc") != string::npos || line.find("OSrIC") != string::npos ||
    line.find("OSRic") != string::npos || line.find("OSRiC") != string::npos ||
    line.find("OSRIc") != string::npos || line.find("OSRIC") != string::npos)

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Osric");         // Create the pattern

while (getline(play, line))
    if (regex_search(line, r))  // Search the line
        cout << line << '\n';
		Osric, who brings back to him that you attend him in
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
LAERTES		Why, as a woodcock to mine own springe, Osric;

OK, but it’s not case-independent.

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Osric", regex_constants::icase);

while (getline(play, line))
    if (regex_search(line, r))
        cout << line << '\n';
OSRIC		|
		[Enter OSRIC]
OSRIC		Your lordship is right welcome back to Denmark.
OSRIC		Sweet lord, if your lordship were at leisure, I
OSRIC		I thank your lordship, it is very hot.
OSRIC		It is indifferent cold, my lord, indeed.
OSRIC		Exceedingly, my lord; it is very sultry,--as
OSRIC		Nay, good my lord; for mine ease, in good faith.
OSRIC		Your lordship speaks most infallibly of him.
OSRIC		Sir?
OSRIC		Of Laertes?
OSRIC		I know you are not ignorant--
OSRIC		You are not ignorant of what excellence Laertes is--
OSRIC		I mean, sir, for his weapon; but in the imputation
OSRIC		Rapier and dagger.
OSRIC		The king, sir, hath wagered with him six Barbary
OSRIC		The carriages, sir, are the hangers.
OSRIC		The king, sir, hath laid, that in a dozen passes
OSRIC		I mean, my lord, the opposition of your person in trial.
OSRIC		Shall I re-deliver you e'en so?
OSRIC		I commend my duty to your lordship.
		[Exit OSRIC]
		Osric, who brings back to him that you attend him in
		Lords, OSRIC, and Attendants with foils, &c]
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
OSRIC		Ay, my good lord.
OSRIC		A hit, a very palpable hit.
OSRIC		Nothing, neither way.
OSRIC				  Look to the queen there, ho!
OSRIC		How is't, Laertes?
LAERTES		Why, as a woodcock to mine own springe, Osric;
OSRIC		Young Fortinbras, with conquest come from Poland,

Basic components of regular expressions:

What Description What Description
. any one char but \n | alternation
[a-fxy0-9] any one of these () grouping
[^a-fxy0-9] not one of these \b word boundary
* 0–∞ of previous \d or \D [0-9] or not
+ 1–∞ of previous \s or \S [ \n\r…] or not
? 0–1 of previous \w or \W [0-9a-zA-Z] or not
{17} 17 of previous ^ beginning of line
{3,8} 3–8 of previous $ end of line

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false

Match a number

Add *:

const regex r("[0-9]*");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true

Huh—that got worse.

Match a number

Add +:

const regex r("[0-9]+");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false

Match a number

Anchored:

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
false
false

Match a number

How about floating-point?

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
false
false
false

Match a number

Need to add the decimal point:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false
false

Match a number

We might be too liberal, now:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
true
true
false
false

Match a number

Let’s insist on digits point digits:

const regex r("^[0-9]+\\.[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
false
true
false
false
false
false
false
false

Why the double backslash?

Match a number

No, the parts should be optional:

const regex r("^[0-9]*\\.?[0-9]*$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
true
false
false
false

Match a number

We express alternation with |.

Match a number

const regex r("^([0-9]+|[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Combine the first two cases:

const regex r("^([0-9]+(\\.[0-9]*)?|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Let’s use \d instead of [0-9]:

const regex r("^(\\d+(\\.\\d*)?|\\d*\\.\\d+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Those double backslashes are hideous. Use a raw string:

const regex r(R"(^(\d+(\.\d*)?|\d*\.\d+)$)");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Should’ve used regex_match instead of regex_search; regex_match matches the entire string. Now we don’t need ^$ and the parentheses:

const regex r(R"(\d+(\.\d*)?|\d*\.\d+)");

cout << boolalpha
     << regex_match("123",       r) << '\n'
     << regex_match("45.67",     r) << '\n'
     << regex_match("78.",       r) << '\n'
     << regex_match(".89",       r) << '\n'
     << regex_match(".",         r) << '\n'
     << regex_match("127.0.0.1", r) << '\n'
     << regex_match("abc123def", r) << '\n'
     << regex_match("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Change of topic

Match contractions

const string s = "I can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

cout << boolalpha
     << regex_search(s, r) << '\n';
true

That was useless. Where are the contractions? I want a list!

Match contractions

const string s = "I can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
2: can't
13: y'all
32: o'clock

Exactly what is iter iterating over?

Modified: 2017-04-24T14:32

User: Guest

Check: HTML CSS
Edit History Source
Apply to CSU | Contact CSU | Disclaimer | Equal Opportunity
Colorado State University, Fort Collins, CO 80523 USA
© 2015 Colorado State University
CS Building