Thoughts on Crafting Interpreters - Part 3

The unexpected part!

I liked hacking on Lox in Zig a lot, so I decided it would be great to make some changes to the language. It should be good for a better understanding of the book’s material, and probably will be a lot of fun!

Minor changes from the book

I looked back on what we’ve done in the book and decided to do some tweaking.

First, I made print to be a function and not a statement. Making it a statement was understandable, as there was no way to call anything in the language yet, and in order to see some evaluation results print statement was introduced. Changing it to be a function has some benefits.

First, we can pass more than one argument, and automatically concatenate those with a space. It’s not the same as printf in C, and with other functions that accept format strings in various languages, but still handy:

print(1, 2, 3) // prints: 1 2 3\n

Secondly, we can pass it as a callback to other functions, which can’t be done with statements. Lox isn’t a real-world language, so this isn’t something that people will probably want to be able to do, but still, it’s an advantage over print being a statement.

Speaking of passing functions - why are there no lambdas in the book? IIRC, the first half had a “Challenge”, for implementing a support of anonymous functions, and in the second half, there was none. It’s not like it’s hard either. I added two possible ways to define an anonymous function.

The first one has exactly the same syntax as a named function, except you don’t specify the name:

fun (x, y, z) { return x + y + z; }

You can assign this expression to a variable, or call it immediately:

ziglox> var foo = fun (x, y, z) { return x + y + z; };
nil
ziglox> return foo(1, 2, 3);
6
ziglox> return (fun (x, y, z) { return x + y + z; })(4, 5, 6);
15

But it’s too verbose, so I added another syntax using a backslash:

ziglox> var bar = \x,y,z -> x + y + z;
nil
ziglox> return bar(1,2,3);
6
ziglox> return (\x,y,z->x+y+z)(4,5,6);
15

It’s a bit terse, and I’m not sure if I like it or not, but it works.

Speaking of backslashes - the book doesn’t deal with string escaping at all. I understand that dealing with it is kinda tricky - you need to escape quotes, escape the already escaped quotes and escape characters, you need to correctly unescape everything when printing, etc. So it being omitted from the book is OK, I guess, but I decided to implement this in the parser and in the tostring function that I added as one of the book’s “Challenges”:

ziglox> return "\"foo\""; // nested strings can be escaped
"\"foo\""
ziglox> print("\"foo\""); // printing correctly unescapes everything
"foo"

This will play nice with changes I’ll describe a bit later.

As a part of another “Challenge”, I changed the virtual machine in such a way that it avoids consecutive popping and pushing in most of the instructions where possible. This was supposed to make the VM slightly faster, however, it did not have any effect on speed, at least how I measured it.

Finally, I decided to add the type function, so the user will be able to tell what kind of an object the function received. This will play nicely with what I will talk about a bit later.

Objects gone

As you may remember, in my previous post on the topic of this book I said that I don’t want to implement classes, and instead would prefer more general hash tables. And, I also mentioned that the lack of arrays is a strange decision, that limits language capabilities quite heavily. So I decided to implement data structures!

Arrays

The first data structure I added was arrays. Adding arrays was easy enough so I still think it is weird that the book doesn’t mention them at all.

First, I needed a few new instructions:

pub const OPCode = enum(Code) {
    // ...
    Array,
    SetIndex,
    GetIndex,
    InvokeFromArray,
};

I chose to implement the syntax for indexing arrays as in most languages, i.e. array[index]. Similarly to how we did with invoking instances, I added other instructions for calling array indexes directly: array[index](args). It’s a minor optimization, but I wanted to do it to better grasp it. This meant that in the scanner I had to implement two new tokens:

--- a/src/zig/lox/scanner.zig
+++ b/src/zig/lox/scanner.zig
pub const TokenType = enum {
    // ...
    LEFT_BRACKET,
    RIGHT_BRACKET,
    // ...
};

// in the scanner's switch statement:
        switch (c) {
            '[' => return self.makeToken(.LEFT_BRACKET),
            ']' => return self.makeToken(.RIGHT_BRACKET),
            // ...
        }

In the compiler, these tokens receive a prefix and infix parse rules:

fn makeRules() []ParseRule {
    // ...
    r[@intFromEnum(TokenType.LEFT_BRACKET)] = .{ .prefix = Parser.array, .infix = Parser.arrayIndex, .precedence = .CALL };
    r[@intFromEnum(TokenType.RIGHT_BRACKET)] = .{ .prefix = null, .infix = null, .precedence = .NONE };
    // ...
}

Rules themselves are simple:

pub const Parser = struct {
    // ...

    fn arrayIndex(self: *Parser, canAssign: bool) LoxError!void {
        try self.expression();
        self.consume(.RIGHT_BRACKET, "Expect ']' after an index.");
        if (canAssign and self.match(.EQUAL)) {
            try self.expression();
            try self.emitByte(@intFromEnum(OPCode.SetIndex));
        } else if (self.match(.LEFT_PAREN)) {
            const argCount = try self.argumentList();
            try self.emitByte(@intFromEnum(OPCode.InvokeFromArray));
            try self.emitByte(argCount);
        } else {
            try self.emitByte(@intFromEnum(OPCode.GetIndex));
        }
    }

    fn array(self: *Parser, _: bool) LoxError!void {
        var elemCount: Code = 0;
        if (!self.check(.RIGHT_BRACKET) and !self.check(.EOF)) {
            var do = true;
            while (do or self.match(.COMMA)) {
                do = false;
                try self.expression();
                elemCount += 1;
                if (elemCount == 255)
                    self.error_("Can't have more than 255 array elements.", .{});
            }
        }
        self.consume(.RIGHT_BRACKET, "Expect ']' after expressions.");
        const arr = (try Array.new(self.vm, elemCount)).obj.asValue();
        try self.emitBytes(@intFromEnum(OPCode.Array), try self.makeConstant(arr));
        try self.emitByte(elemCount);
    }

    // ...
}

Now, onto the object itself.

First, I added the new struct, called Array:

pub const Array = struct {
    const Self = @This();
    obj: Obj,
    values: ArrayList(Value),

    pub fn new(vm: *VM, capacity: u8) LoxError!*Self {
        const obj = try Obj.allocate(vm, Self);
        const out = obj.as(Self);
        out.* = Self{
            .obj = obj.*,
            .values = ArrayList(Value).initCapacity(vm.allocator, @max(4, @as(usize, @intCast(capacity)))) catch
                return LoxError.OutOfMemoryError,
        };
        return out;
    }

    pub fn destroy(self: *Self, vm: *VM) void {
        self.values.deinit();
        vm.allocator.destroy(self);
    }

    pub fn markValues(self: *Self, vm: *VM) !void {
        for (self.values.items) |value| {
            try value.mark(vm);
        }
    }

    pub fn equal(self: *Self, other: *Self) bool {
        if (self.values.items.len != other.values.items.len) return false;
        for (self.values.items, 0..) |item, i| {
            if (!item.equal(other.values.items[i])) return false;
        }
        return true;
    }
};

As you can see, it’s basically a thin wrapper around the ArrayList from Zig’s standard library.

I also implemented the equality semantics. If array lengths are equal and all elements of both arrays are equal then the arrays are equal. We don’t do the deep comparison if arrays are the same objects. Pretty standard stuff.

In the VM, creating the array is straightforward:

const op: OPCode = @enumFromInt(frame.readByte());
switch (op) {
    // ...
    .Array => {
        const arr = frame.readConstant().as(*Array);
        const valueCount = frame.readByte();
        for (0..valueCount) |i| {
            arr.values.append((self.stackTop - (valueCount - i))[0]) catch
                return self.runtimeError("OOME: can't append to array", .{});
        }
        self.stackTop -= valueCount;
        self.push(arr.obj.asValue());
    },
    // ...
};

Maybe a bit sub-optimal, but we had to do it now, as expressions are not always constant.

Tables

A.K.A. dictionaries, hash maps, key-value storage, etc.

I have a firm belief, that an object is just a special case of a hash map, and not vice-versa. With closures, anonymous functions, and hash maps we can implement objects and most of their properties without resorting to making it a part of the language. Well, if we want this to be really efficient, it is probably better to make it part of the language, but I don’t care about OOP code efficiency, or OOP at all for that matter. So I removed all of the code that was related to instances, classes, and objects, and replaced it with a third of the code needed to implement hash tables.

Same as arrays, hash tables required new syntax. I decided to do it his way:

#{"foo" => "bar", "baz" => ["qux", "wax"]}

Additionally, string keys can be written as .key:

#{.foo => "bar", .baz => ["qux", "wax"]}

This matches the key access dict.key.

Implementing this syntax required adding another character to the scanner:

@@ -253,6 +242,7 @@ pub const Scanner = struct {
+            '#' => return self.makeToken(.HASH),
         }

And a new parse rule in the compiler:

fn makeRules() []ParseRule {
    // ...
    r[@intFromEnum(TokenType.HASH)] = .{ .prefix = Parser.hashTable, .infix = null, .precedence = .NONE };
    // ...
 }

The rule is again quite simple:

    fn hashTable(self: *Parser, _: bool) LoxError!void {
        self.consume(.LEFT_BRACE, "Expect '{' before table body.");

        var kvCount: u32 = 0;
        while (!self.check(.RIGHT_BRACE) and !self.check(.EOF)) : (kvCount += 2) {
            if (self.match(.DOT)) {
                self.consume(.IDENTIFIER, "Expect property name after '.'.");
                const s = try String.copy(self.vm, self.previous.name);
                try self.emitConstant(s.obj.asValue());
            } else {
                try self.expression();
            }
            self.consume(.FAT_ARROW, "Expect '=>' after key expression.");
            try self.expression();
            if (!self.check(.RIGHT_BRACE) and !self.check(.EOF))
                self.consume(.COMMA, "Expect ',' after value expression.");
        }

        self.consume(.RIGHT_BRACE, "Expect '}' after table body.");
        if (kvCount % 2 != 0) self.error_("Missing value expression", .{});
        if (kvCount > 255) self.error_("too many key-value expressions", .{});
        const table = (try HashTable.new(self.vm)).obj.asValue();
        try self.emitBytes(@intFromEnum(OPCode.Table), try self.makeConstant(table));
        try self.emitByte(@intCast(kvCount));
    }

In the VM it’s again, pretty straightforward:

const op: OPCode = @enumFromInt(frame.readByte());
switch (op) {
    // ...
    .Table => {
        const table = frame.readConstant().as(*HashTable);
        const valueCount = frame.readByte();
        var i = valueCount;
        while (i > 0) : (i -= 2) {
            const slot = (self.stackTop - i);
            _ = table.fields.set(slot[0], slot[1]) catch
                return self.runtimeError("OOME: can't append to array", .{});
        }
        self.stackTop -= valueCount;
        self.push(table.obj.asValue());
    },
    // ...
};

Array indexing also replaced the get and set functions for tables:

var tbl = #{"foo bar" => "baz"};
var key = "foo bar";

// before:
get(tbl, key); // "baz"
set(tbl, key, "qux");

// now:
tbl[key]; // "baz"
tbl[key] = "qux";

Perhaps a bit unconventional to be able to use array indexing on tables, but I find this OK.

In addition to hash tables, I added hash sets:

var set = #["foo", "bar", "baz"];

I thought about using #{} syntax as in Clojure, but it meant that I needed to choose what data structure to use once I met any of the keys and decide what was a syntactic error and what wasn’t. For example, is it an incorrect hash set or an incorrect hash map?

#{"foo", "baz" => "qux"}

Perhaps, it was meant to be a hash set, but was confused and added => between the elements, because it has the same syntax. Then the error something like "unexpected token '=>' in the hash set literal" would be the correct one. Or, if the programmer wanted a hash map and just forgot to add a value for the first key, then the error would be confusing. So I chose #[] instead.

Multiple value returns

That was a feature I wanted to add, but couldn’t find a suitable way of doing it.

One thing I don’t like about Lua is how it implemented multiple value returns. It has a lot of gotchas, and when something in the language has a gotcha it’s a bad thing in my opinion. You can read this article for a lot of interesting examples, so I’ll keep it brief here.

For example, here’s a function that returns several values:

function foo ()
   return 1, 2, 3
end

We can call a function that accepts multiple values, and pass it the result of calling this function, for example, print is one such function:

print(foo()) -- prints 1	2	3

We can prepend more arguments before the call to foo:

print(0, foo()) -- prints 0	1	2	3

But we can’t do the same after the call to foo:

print(foo(), 4) -- prints 1	4

That’s a gotcha, and it often hits people when they use table.insert in combination with some function that can return multiple values. For example, string.gsub:

local words = {"hello", "world"}
local capitalized_words = {}

for _, word in pairs(words) do
   table.insert(
      capitalized_words,
      string.gsub(word, "^%l", string.upper)
   )
end

If we execute this code, we’ll get the following error:

bad argument #2 to 'insert' (number expected, got string)

Why? Because string.gsub returns the string and then the index where the replacement occurred. And table insert accepts a table, index, and a new value: table.insert(t, 3, "foo"). You can omit the index, and just supply the value like this table.insert(t, "foo") and the value will be inserted as the last in the table. In other words, it’s the same as table.insert(t, #t, "foo"). When we pass string.gsub as the expression that should produce a value, it actually returns two values, and table.insert thinks that we use it’s three-argument arity and fails.

There are many more gotchas to multiple values than that, and the most annoying one perhaps is that you can’t store them in a variable to create a closure. You have to create a table with your variadic arguments if you want to use them in some other inner function. In other words, this code is invalid:

function a(...)
   return function()
      print(...)
   end
end

We get this error:

cannot use '...' outside a vararg function near '...'

Instead, we need to do this:

function a(...)
   -- or use table.pack(...) in lua 5.2 and above
   local va = {n=select("#", ...) ...}
   return function()
      -- unpack was moved to table.unpack in Lua 5.2
      print((table.unpack or unpack)(va, 1, va.n))
   end
end

This has a performance penalty, it is inconvenient, and if you have to mix variadic args from multiple functions you have to copy them to a single table, and then unpack them.

Sorry for this long tangent, but when I was thinking about how I would implement multiple values in Lox. I’m OK with the syntax of Lua here, e.g. return a, b, c and local a, b, c = foo(). However, this was hard to do properly.

I had to change how the return statement works. It’s not hard, we can read values until we find the ; token. We also count the values, and emit the count with the instruction:

fn returnStatement(self: *Parser) LoxError!void {
    if (self.match(.SEMICOLON)) {
        try self.emitReturn();
    } else {
        var valueCount: u32 = 0;
        var do = true; // do-while when?
        while (do or self.match(.COMMA)) {
            do = false;
            try self.expression();
            if (valueCount == 255)
                self.error_("Can't return more than 255 values.", .{});
            valueCount += 1;
        }
        self.consume(.SEMICOLON, "Expect ';' after return value.");
        try self.emitBytes(@intFromEnum(OPCode.Ret), @intCast(valueCount));
    }
}

OK, we can emit the necessary code, but what do we do with it?

One thing we can do is create a slice from the stack and return it as a single value. Then, when we use it as a function parameter, or in an expression, we do the appropriate thing. However, this isn’t particularly great, because the stack is mutable, and we manipulate it extensibly.

ziglox> var x = (\-> return 1, 2, 3)();
nil
ziglox> print(x)
1 2 3
ziglox> return x + 10;
20

Oops, we’ve pushed to the beginning of the stack, and x became 10. Well, perhaps we can keep the stack slice, and just not decrement it, but we need to decide when to decrement it, i.e. once multiple values are no longer used. Which is hard to track.

The reason I wanted to be able to store multiple values in a single variable is because I want them to behave more like tuples, but that are always on the stack, no matter what. So they remain fast, without additional memory allocations. And to be able to make a closure over the multiple values, so we don’t have to fiddle with them like in Lua. This, however, had problems all over the place.

I also needed to implement a syntax for assigning multiple values to multiple variables, but I felt that at this point I was kinda burned out with the project. I’m satisfied with the data structures that I added, and other stuff that is not available in the book. All in all, I feel that I had a good grasp of what this book tried to cover.

Comment via email