CS453 Colorado State University ========================================== Lexical Analysis with JLex ========================================== ------------- Announcements - PA2 is due Wednesday night (2/16), tomorrow - PA3 will be posted tomorrow ------------------------------------ Structure of the full compiler - We are starting out in PA3 with a one pass compiler. - Later we will add multiple passes over the AST. ------------------------------------ Other approaches to lexical analysis Use a lexer generator! Reprise of the 15min example with JLex and JavaCUP. Posted for first week's recitation. ---------------------------- Using JLex with JavaCUP in general Points to get across about .lex and .cup files 1) README files are important. Share what you did! 2) Review the syntax in the .lex and .cup files. 3) Show that the Java files are put into packages. 4) For PA3, start with lex and cup files that work and incrementally add functionality. To generate a lexer: % java -jar JLex.jar myfile.lex % mv myfile.lex.java Yylex.java To create sym.java and parser.java: % java -jar java-cup-11a.jar myfile.cup ==================== Outline of .lex file ==================== package mypackage; import java_cup.runtime.Symbol; %% %line %char %cup %public %eofval{ return new Symbol(sym.EOF, new TokenValue("EOF", yyline, yychar)); %eofval} EOL=\r|\n|\r\n NOT_EOL=[^\r\n] %% "&&" { return new Symbol(sym.AND, new TokenValue(yytext(), yyline+1, yychar)); } . { System.out.println("Illegal character: "+yytext()); } ==================== ==================== Outline of .cup file ==================== package mjparser; import java_cup.runtime.*; terminal AND, BOOLEAN, SEMI; // more terminal declarations non terminal garbage; // Grammar rule with terminals to avoid error message. garbage ::= AND BOOLEAN SEMI; ==================== ------------------------------ JLex syntax (slide 3 for some examples) how to express regular expressions to JLex -page 121-124 in book, look in JLex manual as well single characters and strings in quotes "a", "b", "\n", "\t", "hello" character classes [a-z], [a-zA-Z] union a|b JLex: set of characters that does not include x [^x] JLex: intermediate regular expressions, or regular definitions MYRE=a|b HELLO={MYRE}* Important notes -no spaces allowed in regular expression definition -suggest adding one regular expression at a time and then continually create C file with JLex -can use parentheses in regular expression specification -can not nest union and * in the same regular expression definition ---------------------------- MeggyJava token regular expressions -what are some of the regular expressions for MeggyJava? operators keywords id and num comments ---------------------------- Conflict resolution (selecting amongst >1 possible tokens) How can the lexer tell the difference between "if" and "ifmyvar"? (1) longest match The generated lexer will pick from amongst the longest possible tokens. How can the lexer tell the difference between the keyword "if" and an identifier "if"? (2) priority rule The generated lexer will select the token of highest priority from amongst a set of possible tokens of the same length. Priority is indicated by the order that the regular expressions for the tokens are specified. -------------------------- How do lexer generators work (1) first it takes each regular expression and turns it into an NFA (2) then it combines all of the NFAs into one NFA (3) then it turns the NFA into a DFA ---------------------------------- Nondeterministic Finite Automata See slides 4 through 24, which illustrate the concept of nondeterministic choice in a finite automata. Epsilon transitions In an NFA, edges can be labeled with epsilon. Two examples done on the board. ---------------------------------- Detailed steps for lexer generator (1) Create NFAs for plus, if, id, and ws tokens. Using JLex syntax, the regular expressions are as follows: /* Regular definitions, notational convenience */ LETTER=[A-Za-z] DIGIT=[0-9] LETT_DIG_UND=({LETTER}|{DIGIT}|{UNDERSCORE}) ID={LETTER}{LETT_DIG_UND}* EOL=(\n|\r|\r\n) WS=([ \t])+ % % /* regular expressions for tokens and their associated actions */ "+" { return new Symbol(sym.PLUS, ...); } "if" { return new Symbol(sym.IF, ...); } {ID} { return new Symbol(sym.ID, ...); } {EOL} { /* reset yychar, yychar indicate position in line */ } {WS} { /* ignore */ } Convert each of the above to an NFA -what is the NFA for a single character? e.g. "+" -how do we concatenate two NFAs? e.g. "if" -how do we alternate between two NFAs? e.g. DIGIT e.g. EOL -zero or more of the same NFA? e.g. ID (Do alternation to create LETTER and LETT_DIG_UND teardrops. Then show how to do zero or more.) (2) Create a single NFA out of all individual NFAs. Main IDEA: connect points of all teardrops into a start state to create NFA for all of the tokens. (3) (Suggested Exercise: create DFA from subsets of the above NFA on slide 26). Some smaller examples: 1) EOL NFA to DFA, subset construction - epsilon-closure(T) "is the set of states that can be reached from any state s in T without consuming any input" EOL T epsilon-closure(T) -------------------------- {6} {6} {7} {7,11} {8} {8,11} {9} {9} {10} {10,11} {8,9} {8,9,11} - move(s,c) is "the set of all NFA states reachable by following any single edge with label c from a state s in T" T c move(T,c) ------------------------------ {6} \n {7} {6} \r {8,9} {9} \n {10} NOTE \n is ASCII code 10 \r is ASCII code 13 What will those look like in a hexdump? - Dtran transition function, DTran[T, a] = epsilon-closure( move(T,a) ) T \n \r ----------------------------------- {6} {7,11} {8,9,11} {7,11} [write this out] {8,9,11} {10,11} {10,11} -> draw DFA 2) ID and "if" combined NFAs to DFA -> look at subgraph with states 1, 3, 4, and 5 T epsilon-closure(T) --------------------------- T c move(T,c) ------------------------------ Dtran table DFA State \n \r ----------------------------------- {1} ---------------------------- Recognizing tokens using a DFA (use the DFA we just constructed) -> have the students indicate which state is next and where the last-final, lexeme-begin, and forward character in input are. - when go to error state, might have to push last character back onto input - true for all accept states in IF and ID DFA - not needed for single character tokens such as ")" or tokens that end in a specific character, but can still be implemented as such - longest match - keep processing as long as next character does not lead to error state - priority rule - return token for highest priority rule even if two tokens match in the current state ---------------------------- C comment regular expression NOT_STAR=[^*] NOT_STAR_OR_SLASH=[^*/] C_COMMENT="/*"{NOT_STAR}*("*"({NOT_STAR_OR_SLASH}{NOT_STAR}*)?)*"*/" The C comment, why is the reg expression for it so complicated? - have the students come up with a regular expression allowing no stars or slashes in the comment - what about '/*' (not_star_slash* star* not_star_slash*)* '*/' ? - doesn't allow any front slashes - we want anything but the string '*/' not_star* (star not_slash)* not_star* doesn't work because 1) can get '*/' 2) must have a not_slash character after star 3) can't have something like *abcd* not_star* (star not_slash*)* not_star_slash* 1) Close, but can't get '*x/' -------------------- 2/15/11, mstrout@cs.colostate.ed