Recursive descent parsing P.J. Drongowski 3 November 2004 The parser * The parser recognizes a sentence (e.g., expression, program, ...) + It determines if the sentence is grammatically correct according to rules of a formal grammar + It translates the sentence (expression, program, ...) into an intermediate form, like a tree, which can be processed by the code generator * The parser obtains tokens from the lexical analyzer + The parser gets a new token by calling the ADVANCE method + The parser calls the lexical analyzer's accessor functions to get token type, get lexemes, test lexemes, etc. Recursive descent parsing * Rules of the grammar are "hard coded" into the parser * Basic idea: Given a specification for a grammar, a recursive descent parser is a recursive program that recognizes sentences expressed in the grammar * The function call stack remembers where the parser is in the input stream of tokens; it remembers where we've been * The parser recognizes a sentence in a top-down manner + It progressively recognizes phrases that make up the sentence + Eventually, the recognition process decomposes the phrases/sentence into tokens * The intermediate form (e.g., expression tree) is built from the bottom up + As the levels of nesting unwind, each level returns a pointer to the intermediate form for a recognized phrase + Each level builds a node in the intermediate form that links together the substructures returned from below How are grammars specified? * Productions, for example, Backus-Naur Form (BNF) + A production is a rule consisting of terminals and non-terminals + A terminal can be directly resolved to a token + A non-terminal is the result of applying a rule + A rule has a non-terminal on the left hand side of the definition operator, e.g., '::=' + The right hand side of the definition operator consists of terminals and non-terminals * Warning - not all grammars are well-behaved! + Left recursion is bad for top-down parsing such as recursive descent + Ambiguities may exist and must be resolved - Example: The minus sign - Can be part of a number - Can be a unary operator to take the negative of an expression - Can be a binary subtract operator - Usually, the lexical analyzer disambiguates unary minus - The lexical analyzer must use context to perform disambiguation + Formal languages will be discussed in COMP80 Programming Languages Example production: E ::= E '+' T | E '-' T | T * 'E' and 'T' are non-terminals * '+' and '-' are terminals (actual characters) * The | grammar operator means alternation or choice (this or that) Grammar for simple arithmetic expressions E ::= E '+' T | E '-' T | T T ::= T '*' F | T '/' F | F F ::= '(' E ')' | IDENTIFIER | NUMBER * "IDENTIFIER" and "NUMBER" are terminals (kinds of tokens returned by the lexical analyzer) * This grammar is left recursive (bad) + Naive, straightforward translation to code will recurse until call stack overlfows + Left recursion must be eliminated Eliminating left recursion * Left recursion is eliminated by rewriting the grammar E ::= T E' E' ::= '+' T E' | '-' T E' | EMPTY T ::= F T' T' ::= '*' F T' | '/' F T' | EMPTY F ::= NUMBER | IDENTIFIER | '(' E ')' * The theory behind the rewrite rules is beyond the scope of this course! + The parser must effectively lookahead to see what's coming next + Lookahead is supported by the behavior of ADVANCE in the lexical analyzer * However, you can see that each left recursive rule is rewritten in a systematic way * The rewrite adds a new terminal: the EMPTY string + The EMPTY string is not a token + The EMPTY string replaces the left hand side of a production when the current input symbol doesn't match a legal lookahead symbol * Left recursion is eliminated making possible a direct translation of the rules to code * The clarity of the left recursive grammar, unfortunately, is lost -- it's not as easy to read and understand ******************************************************** Pseudo-code for handling '+', '*' and identifiers ******************************************************** * This pseudo-code is the general outline for parsing the simple expression grammar given above * It shows how to eliminate left recursion * Work to be done + Extend the simple grammar + Handle '-' and '/' using the usual operator precedence + Handle positive integer numbers as expression operands (i.e., both identifiers and positive integer numbers are operands) + Converted to a C++ class (e.g., Prefix) + Code must be added to build the parse tree from bottom up void E() { T(); EPRIME(); } void EPRIME() { if (input == '+') { ADVANCE() ; T() ; EPRIME() ; } } void T() { F(); TPRIME(); } void TPRIME() { if (input == '*') { ADVANCE() ; F() ; TPRIME() ; } } void F() { if (input == identifier) { ADVANCE() ; } else if (input == '(') { ADVANCE() ; E() ; if (input == ')') ADVANCE() ; else ERROR() ; } else { ERROR(); } } Source: "Compiler Design in C" by Allen I. Holub, Prentice Hall, 1990.