Precedence-Climbing Parser ========================== Introduction ------------ The following text and the code behind it was largely written by Angus Griffith. It also appears in the code in markdown format in the module ``mathics.code.parser``. The Mathics parser is an operator precedence parser that implements the precedence climbing method. The AST (Abstract Syntax Tree) produced after parsing is a kind of `M-expression `__ which is then used in evaluation. The Wolfram language is surprisingly complex and has quite a few subtleties; this document attempts to cover the major ones. The language is documented on `the Wolfram website `__ although there are a few errors. One primary feature of the Wolfram language is a large number of operators with precedences. It is this characteristic that makes operator precedence parsing well suited to the language. There are however a few special language features that cannot be parsed by the standard precedence climbing algorithm. Precedence Climbing ------------------- Instead of giving an introduction to precedence climbing, here are two references that explain the algorithm better than I could: * `Wikipedia: Operator-precedence parser `__ * `Eli Bendersky’s blog entry: Parsing expressions by precedence climbing `__ Precedence ~~~~~~~~~~ This algorithm has a natural application to the Wolfram language. Every operator has an inbuilt precedence level which can be accessed with e.g. ``Precedence[Plus]``. Grouping ~~~~~~~~ In addition to precedences, operators in the Wolfram language have a grouping attribute. Consider trying to parse the code snippet ``a + b + c``. There is some ambiguity in what order the operations should be applied. Should we use left associativity ``Plus[Plus[a + b] + c]`` or right associativity ``Plus[a, Plus[b + c]]``? Flat associativity ^^^^^^^^^^^^^^^^^^ In fact the addition operator is usually considered fully associative, that is ``Plus[a, b, c]``. In the Wolfram language this is referred to as ‘flat’ associativity because the expression tree is flat. Flat associativity is not a large issue in practice as the as flat operators can be first treated as left or right associative, whichever is more efficient, and flattened once the AST is constructed. One thing to be aware of is that parenthesis prevent the flattening of expressions at parse time, ``a * (b * c)`` parses as ``Times[a, Times[b, c]]`` but is flattened immediately afterwards. Non-associativity ^^^^^^^^^^^^^^^^^ In addition to left/right associativity some operators are non-associative. That is, an operator ``?`` is non-associative if ``a ? b ? c`` is not valid syntax. The only non associative operator in the Wolfram language is ``PatternTest``. Implementation of binary operators ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The relevant code snippet for binary operators is: .. code:: python def parse_binary(self, expr1, token, p): # called with expr1 """ Implemments grammar rule: expr <- expr1 BINARY expr2 """ tag = token.tag # name of BINARY token q = binary_ops[tag] # lookup precedence of operator if q < p: return None self.consume() # consume BINARY token if tag not in right_binary_ops: # left/right grouping q += 1 expr2 = self.parse_exp(q) # parse expr2 # handle nonassoc operators if tag in nonassoc_binary_ops and expr1.get_head_name() == tag and not expr1.parenthesized: self.tokeniser.sntx_message(token.pos) raise InvalidSyntaxError() result = Node(tag, expr1, expr2) # construct the result: `BINARY[expr1, expr2]` if tag in flat_binary_ops: # flatten the tree if required result.flatten() return result Special case: Implicit times ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In the Wolfram language the expression ``a b`` should parse as ``Times[a b]``. This seemingly simple rule actually creates lots of headaches. In the context of precedence climbing we use the rule ``E_n <- E_n E_n`` where ``n = 400`` is the precedence of ``Times``. Unary operators ~~~~~~~~~~~~~~~ The Wolfram language has two types of unary operators, prefix and postfix. As the names suggest, prefix operators come before the expression, e.g. ``- 2`` and postfix operators come after e.g. ``10 !``. Special case: unary minus ^^^^^^^^^^^^^^^^^^^^^^^^^ Consider the code ``- - a``. One might expect that this parses as ``Minus[Minus[a]]`` but in fact the unary minus is handled at parse time as ``-x`` becomes ``Times[-1, x]``. One might expect that the answer is ``Times[-1, Times[-1, a]]`` but in fact this ``Times`` is flattened at parse time and the answer is ``Times[-1, -1, a]``. We can recover the expected answer by imposing parenthesis to prevent flattening, that is ``-(-a)``. Special case: unary plus ^^^^^^^^^^^^^^^^^^^^^^^^ Unary plus on the other hand is ignored. ``+a`` is the same syntactically as ``a``. Ternary operators ~~~~~~~~~~~~~~~~~ The last type of standard operator in the Wolfram language are ternary operators. There are two ternary operators, ``Infix`` and ``Span`` and both are special cases. In general, they are treated like binary operators. Special case: Infix ^^^^^^^^^^^^^^^^^^^ The simplest ternary operator in the Wolfram language is ``Infix``, for example, ``a ~ b ~ c`` becomes ``b[a, c]``. We treat this like a binary operator but override the rule to consume an extra ``~ E``. The relevant annotated code snippet is: .. code:: python def e_Infix(self, expr1, token, p): # called with expr1 """ Example: expr <- expr1 ~ expr2 ~ expr3 """ q = ternary_ops['Infix'] # lookup precedence of Infix if q < p: return None self.consume() # consume first '~' expr2 = self.parse_exp(q + 1) # consume expr2 self.expect('Infix') # consume second '~' expr3 = self.parse_exp(q + 1) # consume expr3 return Node(expr2, expr1, expr3) # return expr2[expr1, expr3] Special case: Span ^^^^^^^^^^^^^^^^^^ See the section on backtracking. Special case: Integrate ~~~~~~~~~~~~~~~~~~~~~~~ The ``Integrate`` rule is ``E <- Integrate expr1 DifferentialD expr2``. Similarly this can be handled by treating ``Integrate`` as a prefix operator that consumes an extra token and expression. Unlike ``Infix``, the precedences for the inner (``expr1``) and outer (``expr2``) expressions differ. To quote the Wolfram docs: Forms such as ``'Integral' expr1 'DifferentialD' expr2`` have an “outer” precedence just below ``Power``, as indicated in the table above, but an “inner” precedence just above ``Sum``. The outer precedence determines when ``expr2`` needs to be parenthesized; the inner precedence determines when ``expr1`` needs to be parenthesized. This is simple to handle with precedence climbing: .. code:: python def p_Integral(self, token): # `expr <- Integral expr1 DifferentialD expr2 self.consume() # consume 'Integral' inner_prec = all_ops['Sum'] + 1 # lookup inner outer_prec = all_ops['Power'] - 1 # and outer prec expr1 = self.parse_exp(inner_prec) # consume expr1 self.expect('DifferentialD') # consume 'DifferentialD' expr2 = self.parse_exp(outer_prec) # consume expr2 return Node('Integrate', expr1, expr2) Mathics3 implementation ----------------------- .. _precedence-climbing-1: Precedence Climbing Operators ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All the Mathics operators are specified in ``mathics/core/parser/operators.py``. The precedence climbing algorithm is implemented in ``mathics/core/parser/parser.py``. All the special cases are implemented as additional rules. P Rules ^^^^^^^ Methods beginning with ``p_TAG`` declare what to do when the ``TAG`` token is encountered at the beginning of an expression. For example, the first token after an open parenthesis. These rules define all the atomics ``E <- A``, prefix operators, ``E <- PREFIX E`` and also brackets ``E <- ( E )``. E Rules ^^^^^^^ Methods beginning with ``e_TAG`` are called when one expression is already present and we encounter the ``TAG`` token. This covers binary operators ``E <- E BINARY E``, postfix operators ``E <- E POSTFIX``, ternary operators, ``E <- E TERNARY1 E TERNARY2 E``, and functions ``E <- E [ SEQUENCE ]``. B Rules ^^^^^^^ Methods beginning with ``b_TAG`` are used for parsing boxes and can be ignored on first reading of the parser code. Backtracking ^^^^^^^^^^^^ Most of the Wolfram language can be parsed with precedence climbing but there are a few special language features that require something more. The ``Span`` operator is one example. Both ``a ;; b`` and ``a ;;`` are valid syntax, the latter is equivalent to ``a ;; All``. Consider the example: ``a;;!b``. There are four options for parsing this: - ``Span[a, Not[b]]`` - ``Factorial[Span[a, Null]]`` - ``Times[Span[a, All], Not[b]]`` - ``Times[Factorial[Span[a, All]], b]`` ``a ! b`` parses as ``Times[Factorial[a], b]`` which might suggest option 4 is correct but the precedence of ``Span`` is lower than that of ``Times`` so it is parsed first, top down. We might then think that option 1 is correct, but ``Not`` has a lower precedence than ``Span`` so the ``E <- E ;; E`` rule cannot be applied here and we must use the postfix form of ``Span``. Since the precedence of ``Factorial`` is higher than that of ``Span`` we can apply the postfix rule for ``!`` and option 3 turns out to be correct. The problem with ``Span``, and also ``CompoundExpression`` is that they require arbitrary lookahead to see if the right hand side has lower precedence. It’s not a large issue for ``CompoundExpression`` which has very low precedence but nevertheless this language quirk requires backtracking in the parser. Recursive descent parsers in Python ----------------------------------- The lack of tail call optimization (TCO) causes some problems when writing recursive descent (RD) parsers in python. It’s possible to write RD parsers in an iterative style but it’s not nice. One can use trampolining to mitigate this limitation. In practice Python provides 1000 recursions by default which should be enough for most parse trees. Try parsing ``1 + 2 + ... 1000``! References ---------- 1. http://www.antlr.org/papers/Clarke-expr-parsing-1986.pdf Clarke 1986, ‘The top-down parsing of expressions’ 2. http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm Recursive descent parsers, shunting yard algorithm, precedence climbing, and efficient parsers with large number of operator precedences. 3. http://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers How to handle right/left associativity in recursive descent parsers and how to write efficient of RD parsers. 4. https://reference.wolfram.com/language/tutorial/OperatorInputForms.html MMA docs with grouping and relative precedences. 5. http://home.pipeline.com/~hbaker1/CheneyMTA.html Henry Baker 1994, ‘CONS Should Not CONS Its Arguments, Part II: Cheney on the M.T.A.[1]’, unpublished note on implementing TCO with trampolines. Next: .. toctree:: :maxdepth: 1 parsing-a-string.rst