Recursive descent parsing
P.J. Drongowski
3 November 2004

The parser

  * The parser recognizes a sentence (e.g., expression, program, ...)
    + It determines if the sentence is grammatically correct according
      to rules of a formal grammar
    + It translates the sentence (expression, program, ...) into an
      intermediate form, like a tree, which can be processed by the
      code generator
  * The parser obtains tokens from the lexical analyzer
    + The parser gets a new token by calling the ADVANCE method
    + The parser calls the lexical analyzer's accessor functions to
      get token type, get lexemes, test lexemes, etc.


Recursive descent parsing

  * Rules of the grammar are "hard coded" into the parser
  * Basic idea: Given a specification for a grammar, a recursive descent
    parser is a recursive program that recognizes sentences expressed in
    the grammar
  * The function call stack remembers where the parser is in the input
    stream of tokens; it remembers where we've been
  * The parser recognizes a sentence in a top-down manner
    + It progressively recognizes phrases that make up the sentence
    + Eventually, the recognition process decomposes the phrases/sentence
      into tokens
  * The intermediate form (e.g., expression tree) is built from the
    bottom up
    + As the levels of nesting unwind, each level returns a pointer to
      the intermediate form for a recognized phrase
    + Each level builds a node in the intermediate form that links together
      the substructures returned from below


How are grammars specified?

  * Productions, for example, Backus-Naur Form (BNF)
    + A production is a rule consisting of terminals and non-terminals
    + A terminal can be directly resolved to a token
    + A non-terminal is the result of applying a rule
    + A rule has a non-terminal on the left hand side of the definition
      operator, e.g., '::='
    + The right hand side of the definition operator consists of terminals
      and non-terminals
  * Warning - not all grammars are well-behaved!
    + Left recursion is bad for top-down parsing such as recursive descent
    + Ambiguities may exist and must be resolved
      - Example: The minus sign
      - Can be part of a number
      - Can be a unary operator to take the negative of an expression
      - Can be a binary subtract operator
      - Usually, the lexical analyzer disambiguates unary minus
      - The lexical analyzer must use context to perform disambiguation
    + Formal languages will be discussed in COMP80 Programming Languages


Example production:     E ::= E '+' T | E '-' T | T

  * 'E' and 'T' are non-terminals
  * '+' and '-' are terminals (actual characters)
  * The | grammar operator means alternation or choice (this or that)


Grammar for simple arithmetic expressions

        E ::=  E '+' T | E '-' T | T
        T ::=  T '*' F | T '/' F | F
        F ::= '(' E ')' | IDENTIFIER | NUMBER

  * "IDENTIFIER" and "NUMBER" are terminals (kinds of tokens returned by
    the lexical analyzer)
  * This grammar is left recursive (bad)
    + Naive, straightforward translation to code will recurse until
      call stack overlfows
    + Left recursion must be eliminated


Eliminating left recursion

  * Left recursion is eliminated by rewriting the grammar

       E  ::= T E'
       E' ::= '+' T E' | '-' T E' | EMPTY
       T  ::= F T'
       T' ::= '*' F T' | '/' F T' | EMPTY
       F  ::= NUMBER | IDENTIFIER | '(' E ')'

  * The theory behind the rewrite rules is beyond the scope of this course!
    + The parser must effectively lookahead to see what's coming next
    + Lookahead is supported by the behavior of ADVANCE in the lexical analyzer
  * However, you can see that each left recursive rule is rewritten in a
    systematic way
  * The rewrite adds a new terminal: the EMPTY string
    + The EMPTY string is not a token
    + The EMPTY string replaces the left hand side of a production when the
      current input symbol doesn't match a legal lookahead symbol
  * Left recursion is eliminated making possible a direct translation of
    the rules to code
  * The clarity of the left recursive grammar, unfortunately, is lost --
    it's not as easy to read and understand


********************************************************
Pseudo-code for handling '+', '*' and identifiers
********************************************************

  * This pseudo-code is the general outline for parsing the simple
    expression grammar given above
  * It shows how to eliminate left recursion
  * Work to be done
    + Extend the simple grammar
      + Handle '-' and '/' using the usual operator precedence
      + Handle positive integer numbers as expression operands (i.e.,
        both identifiers and positive integer numbers are operands)
    + Converted to a C++ class (e.g., Prefix)
    + Code must be added to build the parse tree from bottom up

        void E() {
          T();
          EPRIME();
        }
        void EPRIME() {
          if (input == '+') {
            ADVANCE() ;
            T() ;
            EPRIME() ;
          }
        }
        void T() {
          F();
          TPRIME();
        }
        void TPRIME() {
          if (input == '*') {
            ADVANCE() ;
            F() ;
            TPRIME() ;
          }
        }
        void F() {
          if (input == identifier) {
            ADVANCE() ;
          } else if (input == '(') {
            ADVANCE() ;
            E() ;
            if (input == ')') ADVANCE() ; else ERROR() ;
          } else {
            ERROR();
          }
        }


Source: "Compiler Design in C" by Allen I. Holub, Prentice Hall, 1990.