Recently I thought it would be good to start compiling the small functional language mimsa I’ve been messing around with for the last few years to LLVM. There’s a few really helpful resources - Joseph Morag’s Micro-C series, and Mapping High Level Concepts to LLVM IR in particular, but neither go into converting functional programming paradigms such as pattern matching, lambdas and ADTs in much depth.
So, here we’re going to give it a go. The plan is that we start with an entirely working but tiny language, get it all working, and then as we add features to the language, we’ll also add a typechecker and introduce more LLVM concepts. The source code for the whole working calculator, along with a basic repl, can be found at llvm-calc.
Our language
The first iteration of our language is going to be the most basic of calculators. Here is a valid program in our language:
1 + 2 * (1 + 100) - 100
Here are the datatypes for our language. We will use this to represent our programs internally.
data Expr ann
= -- | An integer value
EPrim ann Int
| -- | Two values combined with an operator
EInfix ann Op (Expr ann) (Expr ann)
deriving stock (Eq, Ord, Show, Functor, Foldable, Traversable)
-- | Operators for combining integers
data Op = OpAdd | OpMultiply | OpSubtract
deriving stock (Eq, Ord, Show)
Here’s some example values:
-- | the number `1`
one :: Expr ()
= EPrim () 1
one
-- | the expression `2 + 3`
twoPlusTwo :: Expr ()
= EInfix () OpAdd (EPrim () 2) (EPrim () 3) twoPlusTwo
We call datatypes like Expr
Abstract Syntax Trees (or ASTs).
You may notice the ann
type variable. We’ll use this to “decorate” items in
our AST. For now, we’ll just use ()
.
Parsing text into AST terms
Although we could provide our compiler as a Haskell library and ask our users
to manually create Expr
values for us, it’d be much nicer to be able to read
input from a user.
This process is called parsing, and looks something like this:
parseInput :: Text -> Either ParseError (Expr ())
Given some user input, parsing returns either a valid Expr
or a (hopefully)
helpful error.
The whole parser can be seen here. We’ll not go into too much depth as there are already lots of great references on parsing, but a few details are worth noting:
Which library?
We’ll be using a library called megaparsec. It’s fast, it’s got a great tutorial, and it generates errors that we can render nicely with the diagnose library.
Dealing with whitespace
Although some languages make whitespace meaningful, we won’t be bothering with
any of that for now. However we would like to allow our users some flexibility
with how they lay out their code. 1+1
and 1 + 1
should mean the same thing. That means we either need a lot of code like this:
parseInfix :: Parser Int
= do
parseInfix <- space0 -- this accepts 0 or more spaces
_ <- parseInt
a <- space0 -- space between number and operator
_ <- parseOp
op <- <- space0 -- space between operator and second number
<- parseInt
b pure (Infix op a b)
…or we need a neat way of automating this. megaparsec
has a handy thing
called a lexeme
. This is a sort of rule that says
“when eating this value, also eat all the whitespace before it”. When you see
myLexeme
in the code, this means we’re wrapping a parser with
whitespace-eating powers. This is
covered in much better detail in the
megaparsec tutorial.
Parsing infixes
One problem with parser combinator libraries is that it’s easy to get into an
infinite loop. megaparsec
has a bunch of useful helpers for this, in
particular, support for Operator
.
-- | to make infix stuff like easier, `megaparsec` provides
-- the `Operator` type
table :: [[Operator Parser (Expr Annotation)]]
=
table "*" (EInfix mempty OpMultiply)],
[ [binary "+" (EInfix mempty OpAdd),
[ binary "-" (EInfix mempty OpSubtract)
binary
] ]
A nice part of the the above code is that by putting *
in a group before +
and -
, we have defined operator precidence and so 6 * 1 + 1
equals 7
rather
than 12
(which we’d get from 6 * (1 + 1)
).
Source code location
You may have noticed the ann
type variable in our Expr
type. This allows us
to “decorate” our AST nodes. In parsing, it’s useful to decorate each node with
it’s location in the code.
We use the following datatype for this:
data Annotation = Location Int Int
deriving stock (Eq, Ord, Show)
The two Int
values are “character number we start at” and “length of the
string”. Therefore parsing ” 100” should return:
parseResult :: Expr Annotation
= EPrim (Location 1 3) 100 parseResult
We won’t be using these values today as our language is so limited that it’s pretty difficult to break, but as we add multiple types (and thus, the possibility of type errors) we’ll use them to show the user where they did a boo boo.
A simple interpreter
Before we get stuck into LLVM
, it’s good to be able to evaluate our language
internally. Although it’s pretty clear how this calculator should work, it
will be useful to compare our own simple interpreter with the LLVM output as
things get more complicated.
You can view the full code here. As you can see, there isn’t very much of it:
interpret ::
Expr ann ->
Expr ann
EPrim ann p) =
interpret (-- don't do anything with these
EPrim ann p
EInfix ann op a b) =
interpret (-- attempt to simplify these
interpretInfix ann op a b
interpretInfix ::
->
ann Op ->
Expr ann ->
Expr ann ->
Expr ann
OpAdd (EPrim _ a) (EPrim _ b) =
interpretInfix ann -- add the int values
EPrim ann (a + b)
OpSubtract (EPrim _ a) (EPrim _ b) =
interpretInfix ann -- subtract the int values
EPrim ann (a - b)
OpMultiply (EPrim _ a) (EPrim _ b) =
interpretInfix ann -- multiply the int values
EPrim ann (a * b)
=
interpretInfix ann op a b -- one of our arguments isn't yet a number
-- so simplify the leaves and try again
interpretInfix ann op (interpret a) (interpret b)
When we see an EInfix
value, we look for EPrim
values inside and add
/ subtract
/ multiply
them. If they’re
not EPrim
values yet, then they must be nested expressions, so we interpret
them
and try again.
Enough nonsense, let’s do some compiling
Hopefully we understand what our small language is now, so let’s get down to the business of turning it into real life native code.
What is LLVM
LLVM stands for Low Level Virtual Machine. The idea is that higher level languages compile into LLVM, and then LLVM is turned into whatever local version of assembly is required. This means that by using LLVM, your programs will work on lots of architectures without you needing to understand a tremendous amount about them. As somebody who understands pretty much nothing about any processor architecture, this is very appealing indeed. If you need further persuading, Rust
, Swift
and GHC
can all compile to
LLVM.
LLVM is part of the clang
C compiler, and as a result, it’s very C shaped.
It’s got a bunch of number types, structs, and arrays, functions, and enough pointers to have a
very bad time.
Our “runtime”
In order to do things our language will need a runtime, This is a bunch of helper code written in something that be compiled to the target
language. Because we’ll be compiling using clang
, we’ll use C
. I don’t know much about C
, but what I can tell you is that this file contains a single function called printint
, that takes an int
and prints
it to stdout
.
#include <stdio.h>
#include <stdlib.h>
void printint(int i) {
("%d", i);
printf}
This is the only C
you’ll need to see today. I’m sorry about that. Onwards!
Our first LLVM module
LLVM is organised in modules.
To start us off, here is the IR (Intermediate Representation) for printing the number 42
to stdout
.
; ModuleID = 'calc'
declare external ccc i32 @printint(i32)
define external ccc i32 @main() {
%1 = call ccc i32 @printint(i32 42)
ret i32 0
}
Let’s take this apart line by line:
; ModuleID = 'calc'
This is a comment. See, LLVM is easy.
declare external ccc i32 @printint(i32)
This declares an external function we’d like to use. In this case it’s the
printint
function from our standard library defined above. It
takes a single argument, an i32
, and returns an i32
.
define external ccc i32 @main() {
This defines a function called main
that takes no arguments and returns a
single i32
.
The ccc
part is the “calling convention”. This defines the manner in which
LLVM will generate the function code. There are absolutely loads of these, many
designs to optimise specific languages. We’ll keep to ccc
for now, but may
want to change our minds when we start doing tail calls.
%1 = call ccc i32 @printint(i32 42)
This line of defines a variable %1
, and sets it the result of calling the
printint
function we imported above. We pass the printint
function a single
argument, an i32
value of 42
. This will print the number 42
to stdout.
ret i32 0
Functions in LLVM
must return something (although that thing can be null
).
This line returns an i32
value of 0
. Because this is the main
function of
our program, this is what will be returned to the operating system as our exit
code. 0
means “great job, everything went fine”.
}
This is a closing curly brace. LLVM
is like C
, we have to specifically
end our functions. No great hardship though.
Compiling and running our hand-baked module
To check that the code above does what we say it does, and to check we’ve got
everything installed that we need, let’s write the LLVM to a file called
module.ll
, the C code to runtime.c
, and then compile it using clang
.
clang -Wno-override-module -lm module.ll runtime.c -o a.out
This should output an executable called a.out
. If we run it, we should see
this:
./a.out
42
Hooray!
Compiling our programming language to LLVM
OK. So now we’ve had a taste of the raw power available to us, let’s get down to business. Although we could just create raw LLVM IR by hand, instead we will use the following libraries:
llvm-hs-pure - a set of types for LLVM IR, along with some helpful building functions.
llvm-hs-pretty - a
prettyprinter for llvm-hs-pure
.
We’ll use the datatypes in llvm-hs-pure
to create modules, then use llvm-hs-pretty
to render these to files and then compile by hand. This means that when we
invariably generate LLVM errors, we’ll at least we able to look at the error
and reference the code we’ve given it to try and work out what went wrong.
Shut up and show me some Haskell code
The whole module can be seen here, but let’s look at the highlights:
primToLLVM :: Int -> LLVM.Operand
= LLVM.int32 (fromIntegral i) primToLLVM i
This function creates integer literals from our EPrim
constructors, using
the int32 function.
This returns an Operand
(an LLVM value, broadly) that we can pass to other LLVM functions. We’ll
expand this function to include more types as we need them.
exprToLLVM ::
LLVM.MonadIRBuilder m,
( LLVM.MonadModuleBuilder m
=>
) Expr ann ->
LLVM.Operand
m EPrim _ prim) =
exprToLLVM (pure $ primToLLVM prim
EInfix _ OpAdd a b) = do
exprToLLVM (<- exprToLLVM a
lhs <- exprToLLVM b
rhs
LLVM.add lhs rhsEInfix _ OpSubtract a b) = do
exprToLLVM (<- exprToLLVM a
lhs <- exprToLLVM b
rhs
LLVM.sub lhs rhsEInfix _ OpMultiply a b) = do
exprToLLVM (<- exprToLLVM a
lhs <- exprToLLVM b
rhs LLVM.mul lhs rhs
Here we compile our Expr
into and take care of adding /
subtracting / multiplying integers.
Because we’re using the MonadIRBuilder
and MonadModuleBuilder
, it’s almost as if we’re writing code to do the
interpreting by hand, as all the actual code output is plumbed away in the
monad. If you squint, it looks very similar to the interpreter we wrote earlier, so if you
understand that, you pretty much understand this.
-- | given our `Expr` type, turn it into an LLVM module
toLLVM :: Expr ann -> LLVM.Module
=
toLLVM expr "example" $ do
LLVM.buildModule -- import `printint` from our standard library
-- it takes an `i32` and returns an `i32`
<- LLVM.extern "printint" [LLVM.i32] LLVM.i32
printInt
-- create a function called `main` that will be the entry point to our
-- program
"main" [] LLVM.i32 $ \_ -> do
LLVM.function -- build the LLVM AST for our expression
<- exprToLLVM expr
ourExpression
-- print our result to stdout
<- LLVM.call printInt [(ourExpression, [])]
_
-- return success exit code of `0`
0) LLVM.ret (LLVM.int32
Finally, we wrap up our compiled Expr
and turn it into a
module. We are going to import printint
from our standard library, output the
response of our computation to stdout, and then return exit code 0
.
Let’s look at the generated LLVM for 6 * 8 - 3
:
; ModuleID = 'example'
declare external ccc i32 @printint(i32)
define external ccc i32 @main() {
%1 = mul i32 6, 8
%2 = sub i32 %1, 3
%3 = call ccc i32 @printint(i32 %2)
ret i32 0
}
Hopefully, nothing too surprising.
Running our code
As with earlier, our little compiler works by pretty printing the LLVM module
we created, then running clang
to compile it into an executable. I pretty
much stole the compiling code from
Micro-C (thanks/sorry, Joseph!).
You can see our slightly tattered version
here if for some reason you don’t believe any of this actually works.
Well that’s that
Congratulations, you are all low-level compiler experts now. Hopefully that was somewhat helpful. Next time we’ll be adding the equality operator and some basic control flow. Exciting!
Make sense? If not, get in touch!
Further reading: