CS253

CS253: Software Development with C++

Spring 2018

Regular Expressions

See this page as a slide show

Regular Expressions

CS253 Regular Expressions

Pattern Matching

% grep "Osric" ~cs253/pub/hamlet.txt
	Osric, who brings back to him that you attend him in
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
LAERTES	Why, as a woodcock to mine own springe, Osric;
% grep -c "Osric" ~/pub/hamlet.txt
3
% grep -ic "Osric" ~/pub/hamlet.txt
32

In C++

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Osric") != string::npos)
        cout << line << '\n';
	Osric, who brings back to him that you attend him in
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
LAERTES	Why, as a woodcock to mine own springe, Osric;

That’s only three matches. Didn’t we find 32 matches?

Case-independence

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
while (getline(play, line))
    if (line.find("Osric") != string::npos ||
        line.find("osric") != string::npos ||
        line.find("OSRIC") != string::npos)
        cout << line << '\n';
OSRIC	|
	[Enter OSRIC]
OSRIC	Your lordship is right welcome back to Denmark.
OSRIC	Sweet lord, if your lordship were at leisure, I
OSRIC	I thank your lordship, it is very hot.
OSRIC	It is indifferent cold, my lord, indeed.
OSRIC	Exceedingly, my lord; it is very sultry,--as
OSRIC	Nay, good my lord; for mine ease, in good faith.
OSRIC	Your lordship speaks most infallibly of him.
OSRIC	Sir?
OSRIC	Of Laertes?
OSRIC	I know you are not ignorant--
OSRIC	You are not ignorant of what excellence Laertes is--
OSRIC	I mean, sir, for his weapon; but in the imputation
OSRIC	Rapier and dagger.
OSRIC	The king, sir, hath wagered with him six Barbary
OSRIC	The carriages, sir, are the hangers.
OSRIC	The king, sir, hath laid, that in a dozen passes
OSRIC	I mean, my lord, the opposition of your person in trial.
OSRIC	Shall I re-deliver you e'en so?
OSRIC	I commend my duty to your lordship.
	[Exit OSRIC]
	Osric, who brings back to him that you attend him in
	Lords, OSRIC, and Attendants with foils, &c]
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
OSRIC	Ay, my good lord.
OSRIC	A hit, a very palpable hit.
OSRIC	Nothing, neither way.
OSRIC	                  Look to the queen there, ho!
OSRIC	How is't, Laertes?
LAERTES	Why, as a woodcock to mine own springe, Osric;
OSRIC	Young Fortinbras, with conquest come from Poland,

Not satisfied

That’s better but it’s not truly case-independent. What about “OsRiC”, or “oSRIc”? There are 2⁵, or 32, combinations, making the code quite tedious:

if (line.find("osric") != string::npos || line.find("osriC") != string::npos ||
    line.find("osrIc") != string::npos || line.find("osrIC") != string::npos ||
    line.find("osRic") != string::npos || line.find("osRiC") != string::npos ||
    line.find("osRIc") != string::npos || line.find("osRIC") != string::npos ||
    line.find("oSric") != string::npos || line.find("oSriC") != string::npos ||
    line.find("oSrIc") != string::npos || line.find("oSrIC") != string::npos ||
    line.find("oSRic") != string::npos || line.find("oSRiC") != string::npos ||
    line.find("oSRIc") != string::npos || line.find("oSRIC") != string::npos ||
    line.find("Osric") != string::npos || line.find("OsriC") != string::npos ||
    line.find("OsrIc") != string::npos || line.find("OsrIC") != string::npos ||
    line.find("OsRic") != string::npos || line.find("OsRiC") != string::npos ||
    line.find("OsRIc") != string::npos || line.find("OsRIC") != string::npos ||
    line.find("OSric") != string::npos || line.find("OSriC") != string::npos ||
    line.find("OSrIc") != string::npos || line.find("OSrIC") != string::npos ||
    line.find("OSRic") != string::npos || line.find("OSRiC") != string::npos ||
    line.find("OSRIc") != string::npos || line.find("OSRIC") != string::npos)

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Osric");         // Create the pattern

while (getline(play, line))
    if (regex_search(line, r))  // Search the line
        cout << line << '\n';
	Osric, who brings back to him that you attend him in
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
LAERTES	Why, as a woodcock to mine own springe, Osric;

OK, but it’s not case-independent.

Regular expressions

ifstream play("/s/bach/a/class/cs253/pub/hamlet.txt");
string line;
const regex r("Osric", regex_constants::icase);

while (getline(play, line))
    if (regex_search(line, r))
        cout << line << '\n';
OSRIC	|
	[Enter OSRIC]
OSRIC	Your lordship is right welcome back to Denmark.
OSRIC	Sweet lord, if your lordship were at leisure, I
OSRIC	I thank your lordship, it is very hot.
OSRIC	It is indifferent cold, my lord, indeed.
OSRIC	Exceedingly, my lord; it is very sultry,--as
OSRIC	Nay, good my lord; for mine ease, in good faith.
OSRIC	Your lordship speaks most infallibly of him.
OSRIC	Sir?
OSRIC	Of Laertes?
OSRIC	I know you are not ignorant--
OSRIC	You are not ignorant of what excellence Laertes is--
OSRIC	I mean, sir, for his weapon; but in the imputation
OSRIC	Rapier and dagger.
OSRIC	The king, sir, hath wagered with him six Barbary
OSRIC	The carriages, sir, are the hangers.
OSRIC	The king, sir, hath laid, that in a dozen passes
OSRIC	I mean, my lord, the opposition of your person in trial.
OSRIC	Shall I re-deliver you e'en so?
OSRIC	I commend my duty to your lordship.
	[Exit OSRIC]
	Osric, who brings back to him that you attend him in
	Lords, OSRIC, and Attendants with foils, &c]
KING CLAUDIUS	Give them the foils, young Osric. Cousin Hamlet,
OSRIC	Ay, my good lord.
OSRIC	A hit, a very palpable hit.
OSRIC	Nothing, neither way.
OSRIC	                  Look to the queen there, ho!
OSRIC	How is't, Laertes?
LAERTES	Why, as a woodcock to mine own springe, Osric;
OSRIC	Young Fortinbras, with conquest come from Poland,

Basic components of regular expressions:

What Description What Description
. any one char but \n | alternation
[a-fxy0-9] any one of these () grouping
[^a-fxy0-9] not one of these \b word boundary
* 0–∞ of previous \d or \D [0-9] or not
+ 1–∞ of previous \s or \S [ \n\r…] or not
? 0–1 of previous \w or \W [0-9a-zA-Z] or not
{17} 17 of previous ^ beginning of line
{3,8} 3–8 of previous $ end of line

Match a number

Let’s try to match a number:

const regex r("[0-9]");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false

Match a number

Add *:

const regex r("[0-9]*");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true

Huh—that got worse.

Match a number

Add +:

const regex r("[0-9]+");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false

Match a number

Anchored:

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
false
false

Match a number

How about floating-point?

const regex r("^[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
false
false
false

Match a number

Need to add the decimal point:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
false
false

Match a number

We might be too liberal, now:

const regex r("^[0-9.]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
true
true
false
false

Match a number

Let’s insist on digits point digits:

const regex r("^[0-9]+\\.[0-9]+$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
false
true
false
false
false
false
false
false

Why the double backslash?

Match a number

No, the parts should be optional:

const regex r("^[0-9]*\\.?[0-9]*$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
true
false
false
false

Match a number

We express alternation with |.

Match a number

const regex r("^([0-9]+|[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Combine the first two cases:

const regex r("^([0-9]+(\\.[0-9]*)?|[0-9]*\\.[0-9]+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Let’s use \d instead of [0-9]:

const regex r("^(\\d+(\\.\\d*)?|\\d*\\.\\d+)$");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Those double backslashes are hideous. Use a raw string:

const regex r(R"(^(\d+(\.\d*)?|\d*\.\d+)$)");

cout << boolalpha
     << regex_search("123",       r) << '\n'
     << regex_search("45.67",     r) << '\n'
     << regex_search("78.",       r) << '\n'
     << regex_search(".89",       r) << '\n'
     << regex_search(".",         r) << '\n'
     << regex_search("127.0.0.1", r) << '\n'
     << regex_search("abc123def", r) << '\n'
     << regex_search("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Match a number

Should’ve used regex_match instead of regex_search; regex_match matches the entire string. Now we don’t need ^$ and the parentheses:

const regex r(R"(\d+(\.\d*)?|\d*\.\d+)");

cout << boolalpha
     << regex_match("123",       r) << '\n'
     << regex_match("45.67",     r) << '\n'
     << regex_match("78.",       r) << '\n'
     << regex_match(".89",       r) << '\n'
     << regex_match(".",         r) << '\n'
     << regex_match("127.0.0.1", r) << '\n'
     << regex_match("abc123def", r) << '\n'
     << regex_match("Jack",      r) << '\n';
true
true
true
true
false
false
false
false

Change of topic

Match contractions

const string s = "I can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

cout << boolalpha
     << regex_search(s, r) << '\n';
true

That was useless. Where are the contractions? I want a list!

Match contractions

const string s = "I can't feed y'all before three o'clock!";
const regex r("[a-z]+'[a-z]+");

sregex_iterator iter(s.begin(), s.end(), r);
sregex_iterator end;

for (; iter!=end; ++iter)
    cout << iter->position() << ": " << iter->str() << '\n';
2: can't
13: y'all
32: o'clock

Exactly what is iter iterating over?

Modified: 2017-12-29T23:01

User: Guest

Check: HTML CSS
Edit History Source
Apply to CSU | Contact CSU | Disclaimer | Equal Opportunity
Colorado State University, Fort Collins, CO 80523 USA
© 2015 Colorado State University
CS Building