I have a tendency to run around in circles a lot when working on code that doesn’t have an obvious “right way” to do things. While working on my toy compiler I had this issue while implementing the semantic analysis phase (i.e., the type checker):
Where do you store the types?
Some CS professor is probably already screaming “in the symbol table dummy!”, but that’s not what I’m talking about. The job of the symbol table is to map identifiers in the various scopes of the program to their types or other relevant information, but it’s usually an ephemeral thing. Once the semantic analysis phase is over, the symbol table is gone. Parts of it don’t even last more than the scope where they are needed.
But what if you need to know the types of various expressions and such in the later phases? For example, what if you need to go from implicit casts to explicit casts? What what if you have type inference and need to remember what the actual computed type was? What if you need to know which overloaded function got selected? How do you store this information? Well, there are 4 ways I can think of:
The Mutable AST
The Typed AST
The Generic AST
The Relational AST
The Mutable AST
This one is the easiest, but mutation can get awkward depending on the language that you are using to implement the compiler. I’ll use Python here to keep things simple. I’m also not going to show a full blown AST, just little bits to give the idea:
@dataclass
class Expr:
span: Span
e_type: Optional[Type] = field(default=None, init=False)
@dataclass
class Binary(Expr):
left: Expr
op: Token
right Expr
Lets say we’re parsing an expression like 3 + 4.0
. The parser has no idea about types, so it just does the following1:
def parse_binary(self, left: Expr) -> Expr:
op = self.parse_token()
right = self.parse_expression()
span = left.span + right.span
return Binary(span, left, op, right)
The span represents the area of the source code that the expression covers, important to show some red squiggles later 😉. Then in the typechecker we fill in the e_type:
def check_binary(self, b: Binary):
self.check_expr(b.left)
self.check_expr(b.right)
if b.op.id in ("+", "-", "*", "/"):
self.check_if_numeric(b.left)
self.check_if_numeric(b.right)
b.e_type = self.common_supertype(b.left, b.right)
elif b.op.id in ("==", "!="):
...
def check_if_numeric(self, e: Expr):
assert(e.e_type is not None)
if not isinstance(e.e_type, (IntType, FloatType)):
raise TypeCheckError(
f"expected a numeric type, got {e.e_type}"),
e.span)
Something like this. We fill in e_type as we recursively go through the AST in the typechecker, then we see if the e_types of the sub-expressions are of the type we want.
Note that you want your checking function to return “nothing”, because this forces you to grab the e_type in the caller, in turn making sure you don’t forget to set it.
The Good: As simple as it gets. You use the same AST you already created during parsing so no extra allocations necessary, nor any duplication in the source code.
The Bad: Depending on the language that you are using, that
Optional[T]
can get annoying. In Rust it would mean a lot of.unwrap()
for example. Additionally, the structure of the AST classes/structs that makes the most sense for parsing may be awkward to work with in later stages2.The Ugly: Your AST classes/structs reference “types”, which technically come from a later stage in the compilation. Not a major issue by any stretch of the imagination but it is a bit awkward from a code-organization standpoint.
The Typed AST
This one is a lot more work. Here your type checker produces an entirely new structure from the AST, lets call it a Typed-AST or TAST for short. For example:
# AST
@dataclass
class Expr:
span: Span
@dataclass
class Binary(Expr):
left: Expr
op: Token
right Expr
# TAST
@dataclass
class TExpr:
e_type: Type
@dataclass
class TBinary(TExpr):
left: TExpr
op: BinaryOp
right TExpr
Your typechecking functions would take ASTs as input and spit out TASTs as output.
def check_binary(self, b: Binary) -> TBinary:
left = self.check_expr(b.left)
right = self.check_expr(b.right)
...
return TBinary(e_type, left, op, right)
The Good: This approach has the most freedom. You can have the AST be as convenient as possible for parsing/typechecking and the TAST be as convenient as possible for typechecking/code-generation. You can’t forget to set anything, the act of building the TAST forces you to fill in the types. No Optional in sight.
The Bad: You need a whole separate class/struct hierarchy! That’s quite a bit of extra work compared to reusing the same AST. You need to maintain both as you add or change features. This approach also requires more memory allocations.
The Ugly: “TExpr”, “TBinary”, etc. 🤢. I ended up calling the AST ones “ExprNode” and the TAST ones “Expr” instead.
The Generic AST
This one is sort of a middle ground between the two approaches above. Instead of storing an Optional type and mutating the AST, we store a generic metadata field:
# AST
@dataclass
class Expr(Generic[T]):
metadata: T
@dataclass
class Binary(Expr[T]):
left: Expr[T]
op: Token
right Expr[T]
The parser outputs Expr[Span]
while the type checker outputs Expr[Type]
, for example. You can make the metadata as rich as you’d like between stages.
The Good: Like the mutable AST there is only one class/struct hierarchy, easy to maintain. Like the typed AST there are no annoying Optionals relating to the type information.
The Bad: Like the mutable AST the structure may not be the best fit for every compilation stage. Like the typed AST this needs extra memory allocations.
The Ugly: The moment you need more than 1 generic parameter you’ll regret ever picking this option.
The Relational AST
The last option is to not store the extra type information in the AST at all, but instead to do things the relational way: each AST node has a unique ID that serves as the primary key for tables of whatever extra data you need.
@dataclass
class CompilerDB:
expr_types: Dict[int, Type]
expr_spans: Dict[int, Span]
...
@dataclass
class Expr:
id: int
You can improve type-safety by creating distinct “ID” types for different things, instead of using int
everywhere. You can also take the relational model further3:
@dataclass
class CompilerDB:
expressions: List[Expr]
expr_types: Dict[Id[Expr], Type]
expr_spans: Dict[Id[Expr], Span]
...
@dataclass
class Expr:
id: Id[Expr]
@dataclass
class Binary:
left: Id[Expr]
op: Token
right: Id[Expr]
In Python this approach doesn’t make a lot of sense, but in a more high-performance language like C++, Rust or even C# this can lead to some nice performance gains. Not only can the IDs be smaller than a pointer, saving memory, but if your parser builds the list of expressions in the CompilerDB bottom-up, then in the typechecker you can navigate the list linearly instead of using recursion4.
So what’s the catch? Unfortunately mainstream programming languages don’t support relational programming in any meaningful way, so you need to do everything in a rather cumbersome manner:
def check_binary(self, b: Binary):
b_span = self.db.expr_spans[b.id]
l_type = self.db.expr_types[b.left]
l_span = self.db.expr_spans[b.left]
r_type = self.db.expr_types[b.right]
r_span = self.db.expr_spans[b.right]
self.check_numeric(l_type, l_span)
self.check_numeric(r_type, r_span)
e_type = self.compute_super_type(l_type, r_type, b_span)
self.db.expr_types[b.id] = e_type
The Good: Very efficient in both speed and memory usage. Extremely adaptable. Basically all the benefits of the relational model.
The Bad: The CompilerDB needs to be dragged around everywhere. You can’t implement
.str()
in the AST classes themselves, the function needs access to the CompilerDB. Boilerplate-heavy. Easy to make mistakes.The Ugly: It reminded me of how cool the relational model is and how mainstream programming languages don’t even know it exists beyond lousy ORM libraries for SQL databases. I don’t like SQL and I don’t like ORMs.
Conclusion
Ok, four options, which one should you use? If you’re building a small, simple compiler that doesn’t need a lot of extra processing, use the Mutable AST. If your compiler is single pass then it’s a no-brainer since the “annoying optional” problem doesn’t even exist, just fill the type in right there and then.
If you’re building a more complicated compiler, you may want to use either the Typed AST or the Relational AST, depending on your needs.
Most computer-science papers use some variant of the Typed AST approach, usually translating between multiple different intermediate languages. It’s the nicer approach if you want to do lots of analyses and you want to make sure you got them right.
If you want the best performance then the Relational AST is the way to go. You can avoid recursion, you save memory by using IDs instead of pointers, you can move data to different tables depending on usage patterns. It can be very data-oriented.
The Generic AST is not worth it, the other approaches are just better. Any other approach I don’t know about? Let me know in the comments below.
Check out my article on how to parse expressions, it’s really easy :)
An example would be the class representing variable declarations. If you have type inference then the “declared_type” field may not get filled in during parsing, and that means an optional getting dragged around to the later stages.
If every node has a type, you don’t need a dictionary, you can use an array of the same size as the main node array instead. Come to think of it, if even half the nodes need the info, the array version is the way to go. Don’t use “Optionals” though, make a special “Error” type for the others. Accessing them is a bug in your code.
The construction of the list by the parser guarantees that child nodes in the AST are visited before their parents, so in the parent you just grab the already-computed type of the child node in the relevant table.