CS453 Colorado State University ============================================ Lexical Analysis for MiniSVG ============================================ Before Next time: - Tuesday night Meggy Jr assembly fun - HW1 is due "<>" does not start with a letter and "<" is going to be a token, now what do you do? how do you define nested parentheses? - read Ch. 2.1, 2.2, 4 thru 4.2, context free grammars - start working on PA1, on Friday will need to show Kiley in about 30 seconds how your lexer works on an example input file with about half the tokens -------------- Outline - Finish discussion of transition tables - MiniSVG token specifications - lexical analysis for MiniSVG ------------------ Transition diagrams Now have a concept of peeking at the next character and keeping track of the current lexeme. Longest match is the concept that the algorithm using the transition diagram will continue to run through transitions until no outgoing edge is found and therefore "if8" is an ID and not the keyword IF. Priority is the concept that some tokens like keywords have higher priority than other tokens like ID. Therefore, "if" is the keyword IF and not an ID. The labeling of accept states in the DFA implements this. -------------- MiniSVG tokens -> Show regular expression definitions for MiniSVG tokens. token terminal symbol or Tag lexeme - keywords can have lexeme be string for keyword, but in general can be identified by Tag alone - id and num have the terminal symbol ID or NUM and then the lexeme holds the associated string or number -> have pairs of students determine which MiniSVG tokens have extra information in the lexeme -> show them how the Token class is defined in MiniSVGStart - token tags as enums instead of static ints, note that book was written before more recent versions of Java ------------------------------------------ Lexical analysis for MiniSVG - recommend conceptual approach described in Ch 2.6 -> draw stream -> peek refers to next character in stream -> have a separate buffer for lexeme - overview of algorithm for MiniSVG Token scan() { skip white space handle keyword tokens handle start tags handle end tag handle NUM // in quotes handle COLOR // in quotes handle EQ skip comments by calling self recursively } -> step through an example: COLOR \t\n"BLUE""RED"\nEOF ------------------------------------------ Implementation Details - interface Lexer class line and pos public members lexeme private member mapping between string and Token.Tag reserve(string, Token.Tag) Token.Tag lookup(string) int nextChar() -notice returning integer value for character because even though using ASCII have -1 for EOF -maintains position -appends character to lexeme int skipWhiteSpace() -returns character after white space -maintains line and position match(int) -pass in next expected character -throws ParseException if get unexpected character restartLexeme() -empties the lexeme buffer String getLexeme() -returns current lexeme string - errors, will cover more next week - for now keeping track of line and position for each token in nextChar() and skipWhiteSpace - converting numbers - they are in quotes - book suggests multiplying what have so far by 10 and adding next digit, other ideas? [Integer.parseInt()] - string table in book is suggesting one token instance per keyword - starting implementation maps strings to Token.Tag enumerated type and you will have to call Word constructor. You can change this of course. - pushing character back on the buffer - don't have a peek character - instead draw lexeme and filebuf and show what nextChar() and pushChar(), restartLexeme(), and getLexeme() do ------------------------------------------ More challenging tokens Hello, out there - problem: how does the lexer know to NOT skip white space? - do we have a TEXT_START token with multiple lexemes? problem is this breaks the token abstraction another level of logic - go into a different mode when see -in third mode put all white space in lexeme for text string - go back to original mode when see ------------------------ mstrout@cs.colostate.edu, 1/24/11