When developing a hand-written recursive descent parser it's not immediately obvious how to handle expressions. Well, unless the language you are parsing only has prefix operators (like LISP). But how do you handle infix operators without left recursion (which results in an infinite loop)? Turns out, it's ridiculously easy to do!
If you've read a book on compiler development, like the Dragon Book, they tend to spend a lot of time on parsing theory. For example, contrasting the capabilities of LL grammars and LR grammars and how parsers for the former cannot handle left recursion, which is needed to parse infix operations with the correct precedence. You can do post processing on the parse tree to correct the precedence after the fact, but we'd like to parse it the right way to begin with.
The above, however, is theory. In practice, most modern compilers have hand-written recursive descent parsers. Recursive descent parsers, being plain-old code written in a turing complete language, are not bound by the constraints of formal grammars.
Here's how you handle expressions in a recursive descent parser, you need 4 functions:
A function to parse prefix and atomic expressions, we'll call it
parse_prefix
.A function to parse infix expressions, we'll call it
parse_infix
.A function that returns a token’s precedence, we'll call it
get_precedence
.A function to put it all together, we'll call it
parse_expression
.
I’ll write these in Python since it’s one of the most popular languages in the world and therefore statistically very likely to be understood by you dear reader, but porting the code to any language of your choice should be trivial.
The first function, parse_prefix
is a simple recursive descent parsing function:
def parse_prefix(tokens):
match tokens.peek().tag:
case "INTEGER":
return parse_integer(tokens)
case "(":
tokens.consume("(")
expr = parse_expression(tokens)
tokens.consume(")")
return expr
case "-":
op = tokens.consume("-")
right = parse_prefix(tokens)
return UnaryNode(op, right)
case "+":
...
Should be self explanatory I hope. We check the current token and handle it as necessary. If you’re writing a recursive descent parser, most of your code should already look like this.
For the unary minus (-
) case it's better to call parse_prefix
than parse_expression
because you want to apply the unary minus very tightly to its right-hand side, not to a giant arbitrary expression. i.e. we want to avoid this:
- 2 + 3
Getting parsed as this:
- (2 + 3)
Next comes parse_infix1
:
def parse_infix(tokens, left):
match tokens.peek().tag:
case "+" | "-" | "*" | "/":
op = tokens.consume("+")
right = parse_expression(tokens, get_precedence(op.tag))
return BinaryNode(left, op, right)
case "**":
op = tokens.consume("**")
right = parse_expression(tokens, get_precedence(op.tag)-1)
return BinaryNode(left, op, right)
case "(":
op = tokens.consume("(")
args = parse_arguments(tokens)
tokens.consume(")")
return FunctionCallNode(left, args)
...
There are a few important things here. First is that with the exception of the left
parameter that the function expects to receive already parsed, parse_infix
is also just a typical recursive descent parsing function.
The parse_expression
function that we'll see in a bit is responsible for filling that left
parameter in.
The other major difference is that parse_infix
calls parse_expression
recursively but giving it a precedence value.
Precedence (and associativity) are used to handle parsing ambiguities related to expressions with infix operators.
How should the following expression — 2 + 3 * 5
— be parsed?
1. → (2 + 3) * 5
2. → 2 + (3 * 5)
Multiplication has a higher precedence than addition, so the right answer is 2.
To get a right-associative operator like power that binds like this: 2 ** (3 ** 4)
, it suffices to “weaken” its precedence in the recursion, hence the -1 in the power operator case above.
The job of the get_precedence
function is just to map a token to its corresponding predence. The numbers are completely arbitrary, just make sure to give a higher number to multiplication than addition and so on:
def get_precedence(tag):
match tag:
...
case "+" | "-":
return 10
case "*" | "/":
return 20
case "**":
return 30
...
You can use the precedence tables of other programming languages for inspiration.
Finally we get to the function where the magic happens, parse_expression
:
def parse_expression(tokens, precedence=0)
left = parse_prefix(tokens)
while get_precedence(tokens.peek().tag) > precedence:
left = parse_infix(tokens, left)
return left
That's it... that's the whole thing.
This four line function just handles any silly expression-parsing challenges a recursive-descent parser might face.
You don't need to “fix” the AST after the fact, you don’t need to use an LR parser generator, you don't need to implement an improved Packrat Parser, none of that.
Just these four, beautiful, trully magical lines.
The beautiful parsing algorithm above is a Pratt parser, named after Vaughan Pratt who came up with it all the way back in 1973. You can read his paper, Top Down Operator Precedence, for free. I must also thank Bob Nystrom for introducing me to the algorithm. Technically the OG post on the magic of Pratt parsers came courtesy of Douglas Crockford, but I found out about them from Bob.
Unfortunately both the original paper and Douglas Crockford's post obscure the simplicity of the technique with unfortunate naming choices and needlessly dynamic constructs. Bob also unfortunately decided to overcomplicate things with Java-isms. I get why he did it, but I don't like it. This algorithm is so beautiful and simple that there's even a survey of articles about it.
Anyway, no harm in adding one more to the pile (hopefully mine will be added at some point). Here's a tiny but fully working Python script showing off a basic calculator, using Python's own lexer to get the tokens:
from tokenize import generate_tokens, TokenInfo, NUMBER
from io import StringIO
class TokenStream:
def __init__(self, source: str):
self.tokens = generate_tokens(StringIO(source).readline)
self.current = next(self.tokens)
def consume(self) -> TokenInfo:
current = self.current
self.current = next(self.tokens)
return current
def get_precedence(op: str) -> int:
match op:
case "+" | "-":
return 10
case "*" | "/":
return 20
case "**":
return 30
case _:
return 0
def parse_prefix(tokens: TokenStream):
if tokens.current.type == NUMBER:
return int(tokens.consume().string)
match tokens.current.string:
case "+":
tokens.consume()
return parse_prefix(tokens)
case "-":
tokens.consume()
return -parse_prefix(tokens)
case _:
raise Exception(tokens.current.string)
def parse_infix(tokens: TokenStream, left):
match tokens.current.string:
case "+":
op = tokens.consume()
return left + parse_expression(tokens, get_precedence(op.string))
case "-":
op = tokens.consume()
return left - parse_expression(tokens, get_precedence(op.string))
case "*":
op = tokens.consume()
return left * parse_expression(tokens, get_precedence(op.string))
case "/":
op = tokens.consume()
return left / parse_expression(tokens, get_precedence(op.string))
case "**":
op = tokens.consume()
precedence = get_precedence(op.string)-1
return left ** parse_expression(tokens, precedence)
case _:
raise Exception(tokens.current.string)
def parse_expression(tokens: TokenStream, precedence=0):
left = parse_prefix(tokens)
while get_precedence(tokens.current.string) > precedence:
left = parse_infix(tokens, left)
return left
print(parse_expression(TokenStream("2 + -3 * 4 - 5 / 6 ** 7 ** 2")))
Hope it helps!
parse_infix
also handles postfix operators, ternary operators, function calls, indexing, etc. Anything that would call parse_expression left-recursively in its corresponding grammar rule.