How to parse expressions (Easy!)

The magic is only 4 lines of code.

Nov 11, 2023

When developing a hand-written recursive descent parser it's not immediately obvious how to handle expressions. Well, unless the language you are parsing only has prefix operators (like LISP). But how do you handle infix operators without left recursion (which results in an infinite loop)? Turns out, it's ridiculously easy to do!

If you've read a book on compiler development, like the Dragon Book, they tend to spend a lot of time on parsing theory. For example, contrasting the capabilities of LL grammars and LR grammars and how parsers for the former cannot handle left recursion, which is needed to parse infix operations with the correct precedence. You can do post processing on the parse tree to correct the precedence after the fact, but we'd like to parse it the right way to begin with.

The above, however, is theory. In practice, most modern compilers have hand-written recursive descent parsers. Recursive descent parsers, being plain-old code written in a turing complete language, are not bound by the constraints of formal grammars.

Here's how you handle expressions in a recursive descent parser, you need 4 functions:

A function to parse prefix and atomic expressions, we'll call it parse_prefix.
A function to parse infix expressions, we'll call it parse_infix.
A function that returns a token’s precedence, we'll call it get_precedence.
A function to put it all together, we'll call it parse_expression.

I’ll write these in Python since it’s one of the most popular languages in the world and therefore statistically very likely to be understood by you dear reader, but porting the code to any language of your choice should be trivial.

The first function, parse_prefix is a simple recursive descent parsing function:

def parse_prefix(tokens):
    match tokens.peek().tag:
        case "INTEGER":
            return parse_integer(tokens)
        case "(":
            tokens.consume("(")
            expr = parse_expression(tokens)
            tokens.consume(")")
            return expr
        case "-":
            op = tokens.consume("-")
            right = parse_prefix(tokens)
            return UnaryNode(op, right)
        case "+":
            ...

Should be self explanatory I hope. We check the current token and handle it as necessary. If you’re writing a recursive descent parser, most of your code should already look like this.

For the unary minus (-) case it's better to call parse_prefix than parse_expression because you want to apply the unary minus very tightly to its right-hand side, not to a giant arbitrary expression. i.e. we want to avoid this:

- 2 + 3

Getting parsed as this:

- (2 + 3)

Next comes parse_infix1:

def parse_infix(tokens, left):
    match tokens.peek().tag:
        case "+" | "-" | "*" | "/":
            op = tokens.consume("+")
            right = parse_expression(tokens, get_precedence(op.tag))
            return BinaryNode(left, op, right)
        case "**":
            op = tokens.consume("**")
            right = parse_expression(tokens, get_precedence(op.tag)-1)
            return BinaryNode(left, op, right)
        case "(":
            op = tokens.consume("(")
            args = parse_arguments(tokens)
            tokens.consume(")")
            return FunctionCallNode(left, args)
        ...

There are a few important things here. First is that with the exception of the left parameter that the function expects to receive already parsed, parse_infix is also just a typical recursive descent parsing function.

The parse_expression function that we'll see in a bit is responsible for filling that left parameter in.

The other major difference is that parse_infix calls parse_expression recursively but giving it a precedence value.

Precedence (and associativity) are used to handle parsing ambiguities related to expressions with infix operators.

How should the following expression — 2 + 3 * 5 — be parsed?

1. → (2 + 3) * 5

2. → 2 + (3 * 5)

Multiplication has a higher precedence than addition, so the right answer is 2.

To get a right-associative operator like power that binds like this: 2 ** (3 ** 4), it suffices to “weaken” its precedence in the recursion, hence the -1 in the power operator case above.

The job of the get_precedence function is just to map a token to its corresponding predence. The numbers are completely arbitrary, just make sure to give a higher number to multiplication than addition and so on:

def get_precedence(tag):
    match tag:
        ...
        case "+" | "-":
            return 10
        case "*" | "/":
            return 20
        case "**":
            return 30
        ...

You can use the precedence tables of other programming languages for inspiration.

Finally we get to the function where the magic happens, parse_expression:

def parse_expression(tokens, precedence=0)
    left = parse_prefix(tokens)
    while get_precedence(tokens.peek().tag) > precedence:
        left = parse_infix(tokens, left)
    return left

That's it... that's the whole thing.

This four line function just handles any silly expression-parsing challenges a recursive-descent parser might face.

You don't need to “fix” the AST after the fact, you don’t need to use an LR parser generator, you don't need to implement an improved Packrat Parser, none of that.

Just these four, beautiful, trully magical lines.

The beautiful parsing algorithm above is a Pratt parser, named after Vaughan Pratt who came up with it all the way back in 1973. You can read his paper, Top Down Operator Precedence, for free. I must also thank Bob Nystrom for introducing me to the algorithm. Technically the OG post on the magic of Pratt parsers came courtesy of Douglas Crockford, but I found out about them from Bob.

Unfortunately both the original paper and Douglas Crockford's post obscure the simplicity of the technique with unfortunate naming choices and needlessly dynamic constructs. Bob also unfortunately decided to overcomplicate things with Java-isms. I get why he did it, but I don't like it. This algorithm is so beautiful and simple that there's even a survey of articles about it.

Anyway, no harm in adding one more to the pile (hopefully mine will be added at some point). Here's a tiny but fully working Python script showing off a basic calculator, using Python's own lexer to get the tokens:

from tokenize import generate_tokens, TokenInfo, NUMBER
from io import StringIO

class TokenStream:
  def __init__(self, source: str):
    self.tokens = generate_tokens(StringIO(source).readline)
    self.current = next(self.tokens)

  def consume(self) -> TokenInfo:
    current = self.current
    self.current = next(self.tokens)
    return current

def get_precedence(op: str) -> int:
  match op:
    case "+" | "-":
      return 10
    case "*" | "/":
      return 20
    case "**":
      return 30
    case _:
      return 0

def parse_prefix(tokens: TokenStream):
  if tokens.current.type == NUMBER:
    return int(tokens.consume().string)
  match tokens.current.string:
    case "+":
      tokens.consume()
      return parse_prefix(tokens)
    case "-":
      tokens.consume()
      return -parse_prefix(tokens)
    case _:
      raise Exception(tokens.current.string)

def parse_infix(tokens: TokenStream, left):
  match tokens.current.string:
    case "+":
      op = tokens.consume()
      return left + parse_expression(tokens, get_precedence(op.string))
    case "-":
      op = tokens.consume()
      return left - parse_expression(tokens, get_precedence(op.string))
    case "*":
      op = tokens.consume()
      return left * parse_expression(tokens, get_precedence(op.string))
    case "/":
      op = tokens.consume()
      return left / parse_expression(tokens, get_precedence(op.string))
    case "**":
      op = tokens.consume()
      precedence = get_precedence(op.string)-1
      return left ** parse_expression(tokens, precedence)
    case _:
      raise Exception(tokens.current.string)

def parse_expression(tokens: TokenStream, precedence=0):
  left = parse_prefix(tokens)
  while get_precedence(tokens.current.string) > precedence:
    left = parse_infix(tokens, left)
  return left

print(parse_expression(TokenStream("2 + -3 * 4 - 5 / 6 ** 7 ** 2")))

Hope it helps!

parse_infix also handles postfix operators, ternary operators, function calls, indexing, etc. Anything that would call parse_expression left-recursively in its corresponding grammar rule.

Burning the Midnight Coffee

How to parse expressions (Easy!)

The magic is only 4 lines of code.