CS453 Colorado State University
============================================
Lexical Analysis for MiniSVG
============================================
Before Next time:
- Tuesday night Meggy Jr assembly fun
- HW1 is due
"<>" does not start with a letter and "<" is going to be a token, now what do you do?
how do you define nested parentheses?
- read Ch. 2.1, 2.2, 4 thru 4.2, context free grammars
- start working on PA1, on Friday will need to show Kiley in about 30
seconds how your lexer works on an example input file with about half the
tokens
--------------
Outline
- Finish discussion of transition tables
- MiniSVG token specifications
- lexical analysis for MiniSVG
------------------
Transition diagrams
Now have a concept of peeking at the next character and keeping track of the
current lexeme.
Longest match is the concept that the algorithm using the transition diagram
will continue to run through transitions until no outgoing edge is found and
therefore "if8" is an ID and not the keyword IF.
Priority is the concept that some tokens like keywords have
higher priority than other tokens like ID. Therefore, "if" is the keyword
IF and not an ID. The labeling of accept states in the DFA implements
this.
--------------
MiniSVG tokens
-> Show regular expression definitions for MiniSVG tokens.
token
terminal symbol or Tag
lexeme
- keywords can have lexeme be string for keyword, but in general can
be identified by Tag alone
- id and num have the terminal symbol ID or NUM and then the lexeme
holds the associated string or number
-> have pairs of students determine which MiniSVG tokens have extra
information in the lexeme
-> show them how the Token class is defined in MiniSVGStart
- token tags as enums instead of static ints, note that book was
written before more recent versions of Java
------------------------------------------
Lexical analysis for MiniSVG
- recommend conceptual approach described in Ch 2.6
-> draw stream
-> peek refers to next character in stream
-> have a separate buffer for lexeme
- overview of algorithm for MiniSVG
Token scan() {
skip white space
handle keyword tokens
handle start tags
handle end tag
handle NUM // in quotes
handle COLOR // in quotes
handle EQ
skip comments by calling self recursively
}
-> step through an example: COLOR
\t\n"BLUE""RED"\nEOF
------------------------------------------
Implementation Details
- interface
Lexer class
line and pos public members
lexeme private member
mapping between string and Token.Tag
reserve(string, Token.Tag)
Token.Tag lookup(string)
int nextChar()
-notice returning integer value for character because even though
using ASCII have -1 for EOF
-maintains position
-appends character to lexeme
int skipWhiteSpace()
-returns character after white space
-maintains line and position
match(int)
-pass in next expected character
-throws ParseException if get unexpected character
restartLexeme()
-empties the lexeme buffer
String getLexeme()
-returns current lexeme string
- errors, will cover more next week
- for now keeping track of line and position for each token in
nextChar() and skipWhiteSpace
- converting numbers
- they are in quotes
- book suggests multiplying what have so far by 10 and adding next
digit, other ideas? [Integer.parseInt()]
- string table in book is suggesting one token instance per keyword
- starting implementation maps strings to Token.Tag enumerated type
and you will have to call Word constructor. You can change this of
course.
- pushing character back on the buffer
- don't have a peek character
- instead draw lexeme and filebuf and show what nextChar() and
pushChar(), restartLexeme(), and getLexeme() do
------------------------------------------
More challenging tokens
Hello, out there
- problem: how does the lexer know to NOT skip white space?
- do we have a TEXT_START token with multiple lexemes?
problem is this breaks the token abstraction
another level of logic
- go into a different mode when see
-in third mode put all white space in lexeme for text string
- go back to original mode when see
------------------------
mstrout@cs.colostate.edu, 1/24/11