Lexical analysis P.J. Drongowski 22 October 2004 Compiler architecture * Compilers are usually divided into passes where each pass carries out a particular responsibility * The passes intercommunicate using intermediate files * The intermediate files are deleted, so the developer doesn't see them * A command like g++ is actually a driver that controls the compilation and linking process Source code | V Preprocessor --> Parsing and --> Optimization --> Back-end code generation | V Object module * The parsing and code generation pass consists of four main components: + Lexical analyzer: Breaks the source code into tokens, which are symbols used by the parser + Parser: Checks the source code against the language grammar - Identifies syntax errors - Produces an intermediate representation of the program; the IR is a semantic bridge from the language to machine code + Code generator: Produces a first draft of the machine level program to be improved by the optimizer + Symbol table: Is a structured dictionary of reserved words, variables, and other names Lexical analyzer * Individual characters are too low level a representation to parse * The lexical analyzer scans the characters in the source code, breaks them into "lexemes" and then maps the lexemes into "tokens" (indivisible lexical units) * Tokens are indivisible lexical units and are more easily used by the parser; they are the most fundemantal units of the language * A lexeme is the original string from which the token was recognized * Example: In the following sequence of characters: a = b + c ; the lexemes and tokens are: a -> IDENTIFIER = -> ASSIGNMENT_OP b -> IDENTIFIER + -> ADD_OP c -> IDENTIFIER ; -> STATEMENT_SEPARATOR * Some lexemes, like variables names, map to one token while other lexemes, like ';', always map to a single token. Token set * The choice of token set is language dependent * The mapping of symbols to tokens also affects the ease of parsing the language and the size of the parser + One token per input symbol: Makes code generation easier, but results in a larger parser (more parse rules, etc.) + One token for several related input symbols: Produces a smaller parser, but code generation is more complicated; the code generator must use the lexemes to disambiguate the token + Example: Relational operators >, >=, ==, !=, <=, < Source: Compiler Design in C, Allen I. Holub, Prentice-Hall, 1990 ****************************************************** Interface and representation ****************************************************** * The interface to the lexical analyzer consists of: + An enumeration type (or symbolic constants) to represent the tokens + A method function to open and initialize the input token stream + A method function to advance the input token stream to the next token + One or method functions to get the token, get the lexeme associated with the token, or match the lexeme with a pattern string * Lexical analysis is assisted by the ability to peek ahead to the next character in the file input stream + To "peek," means to look at the next character without removing it from the input stream + Peeking is often called "lookahead." + This is needed to recognize multi-character tokens, e.g., <=, == + Peeking also makes it easier to detect the end of long lexemes like identifiers and numbers * Parsing is also assisted by lookahead + Method functions to get the token, etc. effectively peek at the current token without affecting the token input stream + The advance method function moves to the next token + Splitting peek operations from the advance operation gives the parser explicit control over the input token stream -- nothing advances to the next token unless the parser sez its OK * The interface to the lexical analyzer hides all details of file input, etc. Representation of tokens * An enumeration type defines a set of symbolic constants where the value of each constant is unique * This allows for the definition of a set of independent choices, like token type * Example: enum TokenType { UNDEFINED,IDENTIFIER,NUMBER,DELIMITER,END_OF_INPUT } ; Representation of lexemes * Representation of lexeme varies by the value of the symbol to be returned IDENTIFIER --> char identifier[256] NUMBER --> int number ; DELIMITED --> char delimiter ; ****************************************************** Example of a simple lexical analyzer interface ****************************************************** * Infrastructure method functions LexicalAnalyzer() Constructor open(char *filename) Opens the input file, advances to first token close() Closes the input file stream * Advance to next token advance() Advances token stream to next token * Detect token type getToken() Return token's TokenType value isEOI() Return true if END_OF_INPUT isDelimiter() Return true if token is a DELIMITER isIdentifier() Return true if token is an IDENTIFIER isNumber() Return true if token is a NUMBER * Pattern matching matches(char *pattern) Return true if current lexeme matches pattern * Retrieve lexeme getDelimiter(char *dest) Copy delimiter lexeme to destination string getIdentifier(char *dest) Copy identifier lexeme to destination string getNumber() Return integer value of number * Debugging printToken() Print information about current token *********************************** Implementation style: Hard-coded *********************************** * The key method is advance since it: + Recognize the next token + Saves information about the token and lexeme (to be retrieved or tested by the other method functions) * The advance method has three main phases: + Scan and ignore white space (including newline characters) up to the first non-space character + Use the first non-space character to recognize the start of an identifier, number or delimiter + Scan and recognize the rest of the lexeme, and save any necessary information about the token type, etc. * The lexical analyzer uses the C++ input stream function peek() to lookahead to the next character in the file input stream + peek() does not remove the next character from the file input stream + See http://www.cs.tufts.edu/comp/15/notes/cref/file.html for more information about reading from a file * The lexical analyzer must recognize and return the end-of-file condition as the token END_OF_INPUT ==> This sounds complicated, but it's actually straightforward. ==> DO NOT copy the code from the handout (lex.c) + It's old style C that probably will not compile + It's really ugly code -- you can go much better! * The standard C library cctype (#include ) contains helpful functions to determine the lexical type (e.g., letter, digit, space, punctuation, etc.) for ASCII characters. Please use this library. See http://www.cs.tufts.edu/comp/15/notes/cref/cref.html *********************************** Implementation style: State table *********************************** ==> This approach is optional ==> You will probably have to do further independent investigation. You will encounter FSMs in CS theory, language processing, and hardware design. A finite state machine (FSM) consists of: * A finite set of states * A set of transitions from one state to another state * A start state * A set of final, accepting states The states are labelled from 0 to N-1, where N is the total number of states (i.e., including the start and accepting states.) Each transition is labelled with a character type (letter, digit, alphanum, punctuation, etc.) An FSM starts operation in the start state and moves to some other state of the machine as determined by the label on the outgoing transitions. A simple integer is all that is needed to keep track of the current state of the machine. It's often easier to visualize the behavior of a finite state machine as a graph called a "state diagram." The finite state machine can be represented in a table. The rows of the table are selected by the current state of the machine. The columns of the table are selected by the current lookahead character (or type of the current lookahead character.) The FSM transitions are encoded into the state table in the following way. 1. Initialize all entries in the table to -1. 2. For each transition, find the element at the row and column position associated with the "from" state and the transition label. Set the value of the table entry to the number of the "destination" state. Thus, the state table tells the FSM how to go from state to state using the current state and current input symbol. Any table entries containing a -1 after the second step are regarded as "error states," that is, the implicit destination of any disallowed sequence of input symbols. So, what does this have to do with lexical analysis? An FSM can be used to recognize and build tokens. The states and state transitions determine the allowed sequences of characters that form recognizable tokens. When the lexical scanner is called to advance the token stream, the FSM starts in the start state. Using the state table, the FSM moves to a new state as it scans each input character. When the FSM reaches a final, accepting state, the lexical scanner has successfully recognized a token. If an input symbol leads from the current state to an error state, the token has a syntactic error and is not a valid token in the grammar. To make this scheme practical, you must also add "actions" to either the transitions or destination states. The actions store characters in a char array for the lexeme and other bookkeeping chores. Actions for the accepting states return to the caller after remembering token type, etc. Actions for the error states produce error messages and attempt recovery.