IN SPACE

Today I want to dive deep into some interesting processes and lessons I've learned during my work on Reggie (or Regulus), the Gleam to WASM compiler.

The first section dives into what I think is most interesting: language semantics and representing code as trees. The second digs into what I've been working on recently, and finally I'll talk about a feature that gave me a great appreciation for Gleam's official compiler. @lpil.uk is really brilliant.¹

Lessons from an LSP

One of my bigger projects from the past couple of years has been to build an LSP for Python called beacon. Before constructing the intermediate representation (IR) that leads to executable wasm, much of the structure and implementation follows patterns I learned while writing Beacon. I'll dive into those patterns and then the first divergence at the IR phase.

To construct an abstract syntax tree, it uses tree-sitter-python in place of a handwritten implementation. In Reggie's case, delegating ownership of the low-level lexical & syntactic structure tree-sitter-gleam allows the compiler to quickly loads the Gleam grammar & parse the source into a concrete syntax tree. This provides the compiler with enough information to report parse diagnostics by rejecting errors or missing nodes to report parse diagnostics. I made this decision as a shortcut so I wouldn't spend a lot of time defining the grammar and parser implementation and it helped me hit the ground running.

AST

From the concrete syntax tree provided by tree-sitter, the system produces a compiler-owned abstract syntax tree. This translates the data into what essentially amounts to a domain model to be used by the rest of the compilation pipeline. This doesn't provide any meaning relevant at runtime for the parsed program but does keep useful information like

source spans
declarations, expressions, patterns, imports,
source order

At this stage Reggie doesn't handle type inference or import resolution but makes a structure that allows those processes to occur.

One of the most important things the AST takes from tree-sitter is that it represents unsupported-but-parseable syntax that a later phase of the pipeline can handle and provide more specific diagnostic information about.

Name Resolution & Type Checking

After AST construction, the resolver phase maps textual names & imports, i.e. a record of what the specific input is, to targets. Then the type checker "solves" for types² to assign them to symbols and checks them. The now typed modules carry expression/type metadata used by the lowering³ phase.

IR

As I mentioned earlier, this is where Reggie moves beyond patterns in beacon. An intermediate representation creates parameters that allow the compiler's backend to answer questions about what executable code it should produce. The span information and syntax is translated to symbols like functions having parameters, locals, return types, & application binary interfaces (ABIs), stable IDs for local variables, and categorization of function calls.

So now beyond syntax and source code, the IR now creates the primitive for the explicit set of instructions that would be needed in emitted code. More specifically, managed values (collections like strings, lists, & tuples) explicit, as well as lowering failure paths from control flow & pattern matching into panics, todos, and failures. Runtime memory operations are also made explicit.

For example, source code may look like this:

pub fn main(){
  let x = add(1, 2)
  x + 3
}

While the AST is close to the source telling the compiler there's a function, a block, a let, a call, and an operator expression, the IR should make things more concrete by saying:

allocate a local slot for x
call the lowered function symbol for add
store the result in local x
read local x
emit integer addition
return the result

Handling the Gleam Standard Library

Gleam's standard library⁴ is a hex package that you can install as a dependency in your gleam project. Currently my big project is reverse a transitional registry of the standard libraries code (basically a re-implementation & translation that I thought was a good idea).

IO is a host boundary, not pure library code, such that gleam/io.print and gleam/io.println are not treated as normal library functions implemented in Wasm. They map to host imports through an ABI table.

The registry was a mistake because the current setup mixes three different responsibilities in ways that will become harder to maintain, namely Library interfaces, behavior, & runtime/ABI primitives, which should be separate.

For example, gleam/list.map is not a compiler primitive. It is normal Gleam code using functions, lists, recursion, and closures. Implementing it at the compiler/runtime level duplicated upstream behavior and basically created a second standard library.

However, string allocation, closure calls, dynamic value layout, and host println are compiler/runtime/ABI concerns. Those belong in Regulus.

So I've been spending most the weekend refactoring, which I feel is the right decision because it gives us Reggie a cleaner ownership model:

gleam_stdlib source owns list/map/result/option/function/string API behavior where expressible.
Reggie's runtime owns: memory layout, primitive representation, allocation, closure ABI
Reggie ABI owns: host imports, JS package asset validation, boundary type rules
The dependency loader owns the process of finding/loading package source and assets
The resolver & type checker own interfaces from real package source/metadata

Linking Dependencies

At a high-level, dependency linkage happens like this:

load project/dependencies
-> build module/interface 
-> parse/resolve/type-check all selected modules
-> lower each typed module to IR
-> link all lowered IR modules into one backend module
-> emit Wasm

Let's break this down starting with the interface map. The compiler loads module interfaces so project code can resolve and type-check imports. For source-backed dependencies, selected dependency modules are loaded as normal Gleam source modules and compiled through the same pipeline as project modules.

Because of the interface map, the compiler can understand the ownership of a dependency, preventing the linker from guessing ownership from names later. Thus, the compiler knows whether a call is same-project, dependency, stdlib, host external, etc.

After type checking, every typed module lowers to its own IR Module so the lowerer can classify calls and then mark imports as external, as to not export external dependencies in the executable code.

Then the linker renames modules. For every function, constant, constructor, helper, lifted anonymous function, and import wrapper, the linker creates a deterministic backend name based on metadata recorded in previous phases.

Once rewritten, these renamed pieces are concatenated that's ready to be translated to wasm code.

I hope this was informative! Thanks for reading.

If you want to learn more you can check out Reggie's docs and codebase.

RegulusDocs for Regulus (aka Reggie), the experimental Gleam to WebAssembly compiler.https://reggie.desertthunder.dev

GitHub - desertthunder/regulus: a gleam to wasm compilera gleam to wasm compiler. Contribute to desertthunder/regulus development by creating an account on GitHub.https://github.com/desertthunder/regulus

You can trace through this whole process in gleam's codebase here: https://github.com/gleam-lang/gleam/blob/main/compiler-core/src/dependency.rs ↩
This is with a Hindley-Milner style type checker. See: https://stormlightlabs.github.io/beacon/learn/hm.html to learn more ↩
In this post: https://desertthunder.leaflet.pub/3mnxtbca26k2v I give a brief overview of lowering. ↩
https://gleam-stdlib.hexdocs.pm/ ↩