Project #7 - Parse infix expression
P.J. Drongowski
COMP15 Fall 2004

*********************************************
More on the grammar
*********************************************

The original expression grammar is:

        E ::=  E '+' T | E '-' T | T
        T ::=  T '*' F | T '/' F | F
        F ::= '(' E ')' | IDENTIFIER | NUMBER

The first production says that an expression (E) is either:

  * An expression (E) followed by a '+' followed by a term (T), OR
  * An expression (E) followed by a '-' followed by a term (T), OR
  * A term (T.)

When the grammar is rewritten to eliminate left recursion, each production
is replaced by two productions (e.g., the original E is replaced by two
new production rules, E and E'.)

       E  ::= T E'
       E' ::= '+' T E' | '-' T E' | EMPTY
       T  ::= F T'
       T' ::= '*' F T' | '/' F T' | EMPTY
       F  ::= NUMBER | IDENTIFIER | '(' E ')'

EMPTY is a new non-terminal that replaces the terminal on the left hand side
of the BNF definition operator when the current input symbol (token) does not
match a legal lookahead symbol. So, the use of E' in the production rules:

       E  ::= T E'
       E' ::= '+' T E' | '-' T E' | EMPTY

makes an E' *optional*. In the first rule, an expression can be just a
term (T), OR it can be a term followed by an E'. Notice that the second
rule is now right recursive, which is conducive for recursive descent
(top-down) parsing. The rule handles expressions in which a '+' is followed
by a term (T) which may *optionally* be followed by an E'. These rules handle
expressions of the form:

      a + b + c

The identifier 'a' matches the term (T) in the first rule. The rest of
the expression is handled by the second rule. The first '+' matches the '+'
in rule for E', 'b' matches the term (T.) Then the E' rule is applied again
and the second '+' matches the '+' in the E' rule and 'c' matches a term (T.)
The third time an attempt is made to apply the E' rule, the parse succeeds
by matching the EMPTY string and a fourth try is not attempted.

*********************************************
Building the parser
*********************************************

The transformed rules are easily translated to code as shown in the
on-line note on recursive descent.

    http://www.cs.tufts.edu/comp/15/notes/recur_descent

Also, be sure to check out the on-line example that shows how to parse
prefix expressions.

    http://www.cs.tufts.edu/comp/15/notes/prefix

*********************************************
Building the tree
*********************************************

If we could directly apply the left recursive rules, it would be easy to
build the tree. When processing an operator, pointers to both subtrees would
be immediately available and it would be easy to create a node for the
operator, then link the left and right subtrees in.

Unfortunately, the rewritten rules complicate tree building a little bit.
Each private parse function (expr, expr_prime, term, term_prime, factor,
number, identifier) is still a little factory that creates a node when
the parse is successful. However, when calling the "prime" parse functions,
it is necessary to pass forward (downwards) a pointer to the left subtree.

Again, consider the expression: a + b + c.
                                           +
                                          / \
    a + b + c   =>  ((a + b) + c)  =>    +   c
                                        / \
                                       a   b
When applying the rules:

       E  ::= T E'
       E' ::= '+' T E' | '-' T E' | EMPTY

The 'a' matches the term (T) in the first rule. This will become the
left subtree of the final expression tree. A pointer to the node representing
'a' must be passed to the parse function for the E' rule. The rest of the
expression is handled by the E' rule which parses and creates nodes in the
usual recursive manner. Your code simply needs to test for the non-EMPTY
case in which another '+' T sequence has been detected. For the non-EMPTY
case, you will need to link in another subtree and return a pointer to the
larger subtree.

*********************************************
Incremental development style
*********************************************

I strongly urge the use of an incremental development style. Do not try
to write the whole program and debug it in its entirety from the start.
Here is a suggested development sequence.

  1. Build a parser that correctly recognizes well-formed expressions.
     Do not try to build the tree in this initial version.
  2. Use the "success" function as shown in the prefix expression example
     to trace execution of your parser. This will give you insight as to
     how the recursive parse functions are called.
  3. Implement the preorder expression display function. Use this function
     for debugging. Remember, it should be able to display any expression
     tree, so it should be able to display your intermediate results.
  4. Add code to generate nodes for the factor (F) rule. Producing nodes
     for an identifier, number and parenthesized subexpression should be
     trivial. Test your code with simple expressions consisting of an
     identifier, a number, an identifier in parens, and a number in parens.
  5. Attack the T (term) and T' (term_prime) rules next. Test with the
     expressions, a+b and a+b+c. Get this case right and you will be able
     to copy and modify your solution for the E (expr) and E' (expr_prime)
     rules.

*********************************************
Expressions for debugging
*********************************************

Here are the expressions that I used for debugging. They show how I used
an incremental development process to build up to the full program.

    80
    var
    (90)
    (var)
    3*4
    a*b*c
    a+b
    a+b+c
    a+b*c
    (a + b) * c
    (a - b) / c
    ((x2 - x1) * (x2 - x1)) + ((y2 - y1) * (y2 - y1))
    (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1)