Over the last year and a half I haven’t really written much stuff, because I have been working on a small content-addressed programming language called mimsa. I’m not really sure why I am doing this, other than that I watched a neat tutorial about how typecheckers worked, and then got somewhat carried away.
I has sat down to write about my recent implementation of property testing in said lang, then realised that before I can do that I should probably provide some context.
So, what IS a content-addressed language?
According to Unison, the language that pioneered this concept, this means “code is immutable and identified by its content”.
Let’s take that apart a bit:
Say I have a function for adding two numbers:
add :: Int -> Int -> Int
= a + b add a b
Instead of just dumping this in a text file, I would “save” this function. In
the mimsa
repl, I would do this:
:> :bind add = \a -> \b -> a + b
Bound add.
This has saved the expression \a -> \b -> a + b
in the project, and pointed add
to refer to said expression.
Let’s use our new function:
:> add 1 3
4 :: Int
:> add 2 2
4 :: Int
Hopefully, no surprises there.
Let’s make a new function that uses add
:
:> bind add3 = add 3
This has saved the expression add 3
in the DB, and pointed add3
at it. More
importantly, it has also remembered exactly which add
function it used (more
on this shortly).
Like
haskell
orelm
,mimsa
functions only take one argument, so applying3
toadd
(which has typeInt -> Int -> Int
) returns a new function with typeInt -> Int
.
Again, let’s test it and check for surprises:
:> add3 2
5 :: Int
:> add3 0
3 :: Int
All seems well.
Now what if we decide to make add
evil? We are allowed to rebind it, so let’s
do that:
:> bind add = \a -> \b -> a + b + 1
Updated binding of add.
Now when we use add
we’ll get wonky answers as expected.
:> add 2 2
5 :: Int
:> add 0 0
1 :: Int
However, look what happens when we use add3
:
:> add3 1
4 :: Int
We still get the original correct result.
So, what has happened here?
The answer is in how mimsa
expressions are stored. When we stored add
, we
stored not only the expression \a -> \b -> a + b
, but the fact that
add
has no dependencies. Internally, it would look a bit like this:
-- | Distinct type for our hashes, a wrapper around Text
newtype ExpressionHash
= ExpressionHash Text
-- | Simplified version of our StoreExpression type
data StoreExpression =
StoreExpression {
-- | the code for this expression
expression :: Text,
-- | a map from the names of dependencies we've used to their hashes
dependencies :: Map Name ExpressionHash
}
-- how our `add` expression is stored internally
addExpression :: StoreExpression
=
addExpression StoreExpression
= "\\a -> \\b -> a + b",
{ expression = mempty
dependencies }
Once we’ve made the addExpression
, we can create a hash of it, which
we’ll pretend looks like "abc123"
in hexadecimal. Then when we bind
the
name add
to it, we can store it in the project:
-- | Simplified version of our Project type
data Project
= Project
-- | map from names like `add` to hashes
{ bindings :: Map Name ExpressionHash,
-- | map from hashes to expressions
store :: Map ExpressionHash StoreExpression
}
projectWithAdd :: Project
= Project
projectWithAdd : fromList [("add", "abc123")],
{ bindings: fromList [("abc123", addExpression)]
store }
Now when we create add3
, the following happens:
add3Expression :: StoreExpression
=
add3Expression StoreExpression
= "add 3",
{ expression = fromList [("add", "abc123")]
dependencies }
When evaluating add 3
, mimsa
has looked in the project for something called
add
, found it, and then saved that it is needed in this expression. Because
we refer to add
by a hash of it’s content, this is what makes it “content
addressed”.
Assuming add3
has a hash of "def456"
, our project would look like this now:
projectWithAddAndAdd3 :: Project
= Project
projectWithAddAndAdd3 : fromList [
{ bindings"add", "abc123"),
("add3", "def456")
(
],: fromList [
store"abc123", addExpression),
("def456", add3Expression)
(
] }
Now, when we come to update add
and make it evil, we create the following:
evilAddExpression :: StoreExpression
=
evilAddExpression StoreExpression
= "\\a -> \\b -> a + b + 1",
{ expression = mempty
dependencies }
Assuming it has a hash of "ghi789"
, then the project looks like this:
projectWithAddAndAdd3 :: Project
= Project
projectWithAddAndAdd3 : fromList [
{ bindings"add", "ghi789"),
("add3", "def456")
(
],: fromList [
store"abc123", addExpression),
("def456", add3Expression),
("ghi789", evilAddExpression)
(
] }
The project has been updated, but note we haven’t deleted any items from the
store
. All that has been updated in the project is the binding for add
.
This means any new uses of add
will use the new broken
function, but add3
is completely unaffected. If you did want add3
to adopt
the new terrible behaviour, it would be a case of binding it again, where it
would use the broken add
function from the project.
Why would you do this?
Maximum cachability
Because any of the expressions we can created don’t change once created, they
don’t need reparsing or typechecking (or transpiling to JS, etc) over and over. add
will have the type
Int -> Int -> Int
forever, meaning that the idea of a slow build is kinda
removed.
No namespacing issues
In many languages conflicting names of packages or imports can cause issues. We might want to use function A from package version 1.1 but function B from package version 1.3. Most package managers won’t let you do this (or they let it sort of happens sometimes, but weird stuff happens with globals interacting, I’m looking at you React). Expressions in a content-addressed language refer to each other by hashes rather than names, so function A and function B will have no idea about one another unless they depend on one another somehow, and everything works great.
Granular code sharing
Because of the above, the idea of the package as the unit of shared functionality is somewhat obselete. Unison Share demonstrates what it could be like if we shared smaller units instead.
Tests don’t need running over and over
If we test that add 1 3 == 4
, and I know that add
is not going to change,
then I can keep this test result around and don’t need to run it again. When
add
is rebound, we can make a copy of the test and run it on the new
implementation to see if the same test still passes. Property tests, it turns
out, aren’t quite so simple, but I’ll come to that in a later post.
Doesn’t this mean that most developer tools that are built assuming programming means text files and diffs to said files are all unusable here?
Ahem.
That’s all the words
That’s some context of what content-addressed languages are, at least, my weird
understanding of them anyway. The actual implementation in mimsa
is more
complicated than above (for instance, we store the raw AST rather than the
text syntax so we can change the syntax without breaking the expressions, and
store the history of name bindings rather than just the newest one) but
hopefully it gives you a clue about what is going on under the hood.
Make sense? If not, get in touch!
Further reading: