Expression Lexing and Regulare Expressions

Objectives
  • Introduce the topic of lexical analysis in a programming language such as Java.

  • Develop a robust lexer that is successful regardless of the whitespace.

    • Should be able to parse both "(6 * a) + (b / 4)" and "(6*a)+(b/4)".

Getting Started

Your directory should look like this:

L14/
└── src
    ├── Lexer.java
    └── TestCode.java
Description

Lexical analysis is the first phase of a compiler. It involves taking a series of words and breaking them down into tokens by removing whitespace and comments.

The Lexer has several different versions of a lexing method for identifying tokens within an expression.

  • The first method is called scannerLexer, and uses a Scanner object.

  • The second method is called splitLexer, which uses the method String.split().

Instructions

Use the javadoc to implement the methods is the Lexer class.

HINT: After a token is returned from each of the different lexing methods, call the String.trim() method to remove extra whitespace from the beginning and end of the string. If the token is empty, do not add it to the ArrayList.

Regular Expressions

This portion of the lab must be completed on a linux machine (such as the lab machines).

Regular expressions are invaluable for pattern matching, filtering strings, and finding occurences of phrases in large projects.

First, open a terminal and navigate to your eclipse workspace for this semester.

cd ~/<path to your workspace>

Run the following command, and see what output you get.

grep -r -P --include="*.java" 'print(f|ln)?\(' .

Let’s break down this command.

  • grep is a command which searches one or more files for lines which match a string pattern, and prints each matching line.

  • -r this flag instructs grep to search recursively, including all files in subdirecories.

  • -P this flag instructs grep to interpret the pattern as a regular expression.

  • --include="*.java" this instructs grep to include only files which end with the .java extension.

  • 'print(f|ln)?\(' this is the pattern grep searches for. It will search for the string 'print', followed by either the strings 'f' or 'ln', which are optional as instructed by ?, followed by the string '('. The escape character '\' is necessary to capture ceratin special characters such as (.

Now, run the following command to see how many lines of code you’ve written which incude a print statement.

grep -r -P --include="*.java" 'print(f|ln)?\(' . | wc -l

Try running the first command, replacing the pattern with '[a-zA-Z]+[0-9]'. This should show you every line where you’ve referenced a varaible that ends in a number. [a-zA-Z] matches any character in the range a to z and A to Z. + signifies that one or more character in this category must be present. [0-9] matches any character in the range 0 to 9. Think about how you could improve this pattern to match only instances where variables are called.

Using this same command format modifying only the pattern, and the tutorials below for reference, write regular expressions to answer the following questions:

  • How many times have you called the print command (not printf or println)?

  • How many times have you written a single line comment (one that begins with //)?

    • How many times have you written a single line comment on the same line as a line of code (such as int foo = 5 //assign 5 to foo)

  • How many times have you written a for loop or a while loop?

    • How many times have you written a for loop or while loop and not included a space between for and the first (?

  • How many times have you written a for each loop (such as for (Movie m : movies) {)?

  • OPTIONAL BONUS: How many times have you written a for loop that used i as the incrementor (such as for (int i = 0; i < 10; i++) {)?

Use the following resources to learn more about Regular Expressions and to answer questions.

regular_expressions.png
Submission

To receive credit for this recitation show your TA or helper that your program passes the TestCode and the answers to the questions above, along with the grep output you used to find them.