CS301
Foundations of Computer Science

Department of Computer Science
Link to Colorado State University Home
 Page

How to Test Your Regular Expressions in Unix

The Unix command egrep is used to select lines from a file that contain strings matching a given regular expression. So, you may use egrep to test your answer to a regular expression exercise by first creating a file containing some subset of all the strings defined over a given alphabet, one string per line, then using egrep to select the lines that match your regular expression. Looking through the resulting list will help you decide if your regular expression is correct. If you see strings that are not part of the language you are trying to define, obviously your regular expression is not correct.

Take some time now to read the Unix man page for egrep.

To help you get started, here is a file named sigmaStar.cpp that produces all strings with length 8 or less from the language {a,b}*, ordered by length. You can compile it by doing

% g++ -o sigmaStar sigmaStar.cpp
and run it by doing:
% sigmaStar | more

a
b
aa
ab
ba
bb
aaa
aab
aba
abb
baa
bab
bba
bbb
aaaa
aaab

and so on.

Now let's try to specify a regular expression for the set of all strings containing an even number of symbols. In regular set notation, this would be {aa,ab,ba,bb}*. In the regular expression syntax of Unix, in particular, of egrep, this would be (aa|ab|ba|bb)*. However, this will match every substring with an even number of symbols. We want to select lines for which the entire string is made up of an even number of symbols. We can specify this by telling egrep to apply the regular expression to the entire line by adding the beginning of line character, ^, and end of line character, $, to our regular expression to get ^(aa|ab|ba|bb)*$.

Let's try it. We can run sigmaStar and redirect the result into a file and apply egrep to it. Here are the steps and the result:

% sigmaStar > output
% egrep '^(aa|ab|ba|bb)*$' output | more

aa
ab
ba
bb
aaaa
aaab
aaba
aabb
abaa
abab
abba
abbb
baaa
baab
baba
babb
bbaa
bbab
bbba
bbbb
aaaaaa
aaaaab

and so on. Notice that the first one is a blank line, representing the null string which does have even length.

Now try for the odd length strings. Just start with the regular expression for even length strings and add one more symbol from our alphabet.

% egrep '^(aa|ab|ba|bb)*(a|b)$' output | more
a
b
aaa
aab
aba
abb
baa
bab
bba
bbb
aaaaa
aaaab
and so on.

You can skip the step of creating a file with all the strings by piping the output of sigmaStar directly into egrep:

% sigmaStar | egrep '^(aa|ab|ba|bb)*$' | more

aa
ab
ba
bb
aaaa
aaab

and so on.
Back to the CS301 web page.