Lexical analysis
P.J. Drongowski
22 October 2004

Compiler architecture

  * Compilers are usually divided into passes where each pass
    carries out a particular responsibility
  * The passes intercommunicate using intermediate files
  * The intermediate files are deleted, so the developer doesn't see them
  * A command like g++ is actually a driver that controls the compilation
    and linking process

  Source code
       |
       V
  Preprocessor -->   Parsing and    -->  Optimization  -->  Back-end
                   code generation                             |
                                                               V
                                                         Object module

  * The parsing and code generation pass consists of four main components:
    + Lexical analyzer: Breaks the source code into tokens, which are
      symbols used by the parser
    + Parser: Checks the source code against the language grammar
      - Identifies syntax errors
      - Produces an intermediate representation of the program; the
        IR is a semantic bridge from the language to machine code
    + Code generator: Produces a first draft of the machine level program
      to be improved by the optimizer
    + Symbol table: Is a structured dictionary of reserved words, variables,
      and other names

Lexical analyzer

  * Individual characters are too low level a representation to parse
  * The lexical analyzer scans the characters in the source code,
    breaks them into "lexemes" and then maps the lexemes into "tokens"
    (indivisible lexical units)
  * Tokens are indivisible lexical units and are more easily used by the
    parser; they are the most fundemantal units of the language
  * A lexeme is the original string from which the token was recognized
  * Example: In the following sequence of characters:
      a = b + c ;
    the lexemes and tokens are:
      a    ->     IDENTIFIER
      =    ->     ASSIGNMENT_OP
      b    ->     IDENTIFIER
      +    ->     ADD_OP
      c    ->     IDENTIFIER
      ;    ->     STATEMENT_SEPARATOR
  * Some lexemes, like variables names, map to one token while other lexemes,
    like ';', always map to a single token.

Token set

  * The choice of token set is language dependent
  * The mapping of symbols to tokens also affects the ease of parsing
    the language and the size of the parser
    + One token per input symbol: Makes code generation easier, but results
      in a larger parser (more parse rules, etc.)
    + One token for several related input symbols: Produces a smaller parser,
      but code generation is more complicated; the code generator must use
      the lexemes to disambiguate the token
    + Example: Relational operators >, >=, ==, !=, <=, <

Source: Compiler Design in C, Allen I. Holub, Prentice-Hall, 1990

******************************************************
Interface and representation
******************************************************

  * The interface to the lexical analyzer consists of:
    + An enumeration type (or symbolic constants) to represent the tokens
    + A method function to open and initialize the input token stream
    + A method function to advance the input token stream to the next token
    + One or method functions to get the token, get the lexeme associated
      with the token, or match the lexeme with a pattern string
  * Lexical analysis is assisted by the ability to peek ahead to the next
    character in the file input stream
    + To "peek," means to look at the next character without removing it
      from the input stream
    + Peeking is often called "lookahead."
    + This is needed to recognize multi-character tokens, e.g., <=, ==
    + Peeking also makes it easier to detect the end of long lexemes like
      identifiers and numbers
  * Parsing is also assisted by lookahead
    + Method functions to get the token, etc. effectively peek at the
      current token without affecting the token input stream
    + The advance method function moves to the next token
    + Splitting peek operations from the advance operation gives the
      parser explicit control over the input token stream -- nothing
      advances to the next token unless the parser sez its OK
  * The interface to the lexical analyzer hides all details of file
    input, etc.

Representation of tokens

  * An enumeration type defines a set of symbolic constants where the
    value of each constant is unique
  * This allows for the definition of a set of independent choices,
    like token type
  * Example:
     enum TokenType { UNDEFINED,IDENTIFIER,NUMBER,DELIMITER,END_OF_INPUT } ;

Representation of lexemes

  * Representation of lexeme varies by the value of the symbol to be returned

      IDENTIFIER   -->   char identifier[256]
      NUMBER       -->   int number ;
      DELIMITED    -->   char delimiter ;

******************************************************
Example of a simple lexical analyzer interface
******************************************************

  * Infrastructure method functions

      LexicalAnalyzer()         Constructor
      open(char *filename)      Opens the input file, advances to first token
      close()                   Closes the input file stream

  * Advance to next token

      advance()                 Advances token stream to next token

  * Detect token type

      getToken()                Return token's TokenType value
      isEOI()                   Return true if END_OF_INPUT
      isDelimiter()             Return true if token is a DELIMITER
      isIdentifier()            Return true if token is an IDENTIFIER
      isNumber()                Return true if token is a NUMBER

  * Pattern matching

      matches(char *pattern)     Return true if current lexeme matches pattern

  * Retrieve lexeme

      getDelimiter(char *dest)   Copy delimiter lexeme to destination string
      getIdentifier(char *dest)  Copy identifier lexeme to destination string
      getNumber()                Return integer value of number

  * Debugging

      printToken()               Print information about current token


***********************************
Implementation style: Hard-coded
***********************************

  * The key method is advance since it:
    + Recognize the next token
    + Saves information about the token and lexeme (to be retrieved
      or tested by the other method functions)
  * The advance method has three main phases:
    + Scan and ignore white space (including newline characters) up to
      the first non-space character
    + Use the first non-space character to recognize the start of an
      identifier, number or delimiter
    + Scan and recognize the rest of the lexeme, and save any necessary
      information about the token type, etc.
  * The lexical analyzer uses the C++ input stream function peek() to
    lookahead to the next character in the file input stream
    + peek() does not remove the next character from the file input stream
    + See http://www.cs.tufts.edu/comp/15/notes/cref/file.html for more
      information about reading from a file
  * The lexical analyzer must recognize and return the end-of-file
    condition as the token END_OF_INPUT

==>  This sounds complicated, but it's actually straightforward.

==>  DO NOT copy the code from the handout (lex.c)
       + It's old style C that probably will not compile
       + It's really ugly code -- you can go much better!

  * The standard C library cctype (#include <cctype>) contains helpful
    functions to determine the lexical type (e.g., letter, digit, space,
    punctuation, etc.) for ASCII characters. Please use this library.
    See http://www.cs.tufts.edu/comp/15/notes/cref/cref.html

***********************************
Implementation style: State table
***********************************

==>  This approach is optional

==>  You will probably have to do further independent investigation.
     You will encounter FSMs in CS theory, language processing, and
     hardware design.

A finite state machine (FSM) consists of:

  * A finite set of states
  * A set of transitions from one state to another state
  * A start state
  * A set of final, accepting states

The states are labelled from 0 to N-1, where N is the total
number of states (i.e., including the start and accepting states.)

Each transition is labelled with a character type (letter, digit,
alphanum, punctuation, etc.) An FSM starts operation in the
start state and moves to some other state of the machine as
determined by the label on the outgoing transitions. A simple integer
is all that is needed to keep track of the current state of the machine.
It's often easier to visualize the behavior of a finite state machine
as a graph called a "state diagram."

The finite state machine can be represented in a table. The rows
of the table are selected by the current state of the machine. The
columns of the table are selected by the current lookahead character
(or type of the current lookahead character.)

The FSM transitions are encoded into the state table in the following
way.
  1. Initialize all entries in the table to -1.
  2. For each transition, find the element at the row and column position
     associated with the "from" state and the transition label. Set the value
     of the table entry to the number of the "destination" state.
Thus, the state table tells the FSM how to go from state to state using
the current state and current input symbol. Any table entries containing
a -1 after the second step are regarded as "error states," that is, the
implicit destination of any disallowed sequence of input symbols.

So, what does this have to do with lexical analysis? An FSM can be used
to recognize and build tokens. The states and state transitions determine
the allowed sequences of characters that form recognizable tokens. When
the lexical scanner is called to advance the token stream, the FSM
starts in the start state. Using the state table, the FSM moves to a new
state as it scans each input character. When the FSM reaches a final,
accepting state, the lexical scanner has successfully recognized a token.
If an input symbol leads from the current state to an error state, the
token has a syntactic error and is not a valid token in the grammar.

To make this scheme practical, you must also add "actions" to either
the transitions or destination states. The actions store characters
in a char array for the lexeme and other bookkeeping chores. Actions
for the accepting states return to the caller after remembering token
type, etc. Actions for the error states produce error messages and
attempt recovery.