On Scheme's Dots

This is the first public post that I wrote because I’ve finally thought that I’m clever enough to do so. Yet I must mention that I have never actually studied Programming Language Theory (PLT), so I’m not a specialist on the topic I gonna talk about here.

Year ago I decided that I should expand my knowledge in programming languages, and software development in general. I’ve already programmed for several years in C for bare metal, and in C++, as well as used some scripting languages, like Awk and Perl. But all these languages are mostly imperative styles, so I wanted something different.

I had a friend talk about Lisp, and he showed me this post by Paul Graham and I was inspired by the things that it describes. I’ve become interested in Lisp, so I got my digital copy of SICP and started reading it.

That was a year ago.

I’ve just finished chapter one.

Not that I’m stupid or something, but some exercises there are brutal. I remember that I set 3 days on one task where you have to complete two missing lines. It may not be that hard for others but certainly was very hard for me. So it took me a full year to go through the first chapter, but that only thing already changed how I look at the code today.

Additionally to SICP, I’ve studied Rust by reading TRPL. Because I wanted to learn both Rust and Scheme, I’ve decided that it will be a great exercise to write my own Scheme in Rust! Except, I’ve only finished the first chapter of SICP, and I don’t know how language works. And, actually, I don’t know anything about language design.

So how do you write a language in Rust, when you never studied PLT, and only programmed in C, barely know Rust, only toyed with different interpreted languages, finished the first chapter of SICP, and wrote some very basic scripts in Scheme during it?

Well, that’s a good question, I guess. You may think that I decided to read about Scheme and Lisp, something like this, but no, actually I’ve decided that I’ll just throw some code to the Scheme REPL and see how it behaves.

And that’s where the fun started.

Reverse-Engineering Scheme

I’m not a qualified reverse engineer, and this doesn’t actually have that much with real reverse engineering. I called it this way because the process of defining rules by behavior observation is somewhat related to reverse-engineering process, but does not go as deep as real reverse engineering does. I’ve taken two languages, one was GNU Guile, which implements R5RS standard, and the other was Racket, which is not strictly a Scheme, but a Programming-Language Programming Language, and started throwing different code in theirs REPLs. The observed behavior of such a concept as the dot is what motivated me to write this post.

The good thing about Scheme, and actually one of the main reasons why it is so powerful, is that it is very homoiconic, and code directly represents the underlying data structure, which is a list. Lisp itself stands for List Processor, and I never gave this enough thought until I started working on an implementation of a language. You take most of the information for granted (or at least I do) when it is about something you do not interact with too often.

So what is a list in Scheme? Well, anything that is enclosed inside parentheses. And we know that most everything is enclosed with parentheses in Lisp, so what’s the point of this information? Why everything should be a list anyway?

Because lists are easy to manipulate.

We need to know that Scheme has two main data structures: list and atom. Lists consist of atoms or other lists and can be destructed into individual atoms, and atoms can be combined into lists. For example, digits like 1, 2, 3, are atoms. The list of these digits is represented as '(1 2 3).

In Scheme, there are three primitive procedures that work with lists: cons, car, and cdr. Names may be not very representative, but that are historical names (and Lisp has a very long history) so just accept it for now. What do these procedures do? cons stands for construction, and it is used to construct lists. Here’s what you will see if you start Racket REPL and input this code:

> (cons 1 '(2 3))
'(1 2 3)

Note: I’ll be using Racket’s REPL format to represent things in this post because Racket’s default REPL is minimal and only consists of > character for prompt and output is typically on the next line, compared to Guile’s REPL which is more busy looking with all of its additional info. The Guile was taken as reference language though. But in terms of this text, Guile and Racket behave the same.

car retrieves the first element of the list and cdr returns all elements of the list after the first one:

> (car '(1 2 3))
1
> (cdr '(1 2 3))
'(2 3)

Combined with recursion and a functional style of programming we can traverse the list by taking out the first element, modifying it, and creating new lists, leaving the rest of the list for the next iteration. That’s may sound like a basic thing, but it is a really powerful concept because everything in Scheme is a list.

But I’ve actually lied to you. cons, car, and cdr doesn’t work with lists. Those work with pairs.

Like, what? What is a pair? Previously I’ve said everything is a list or atom, and now I say that there’s something other than that? Well, yes, pair is a pair of atoms, it is written as a list, but values are separated by the dot: (1 . 2). Pair is not a list, but a more fundamental thing - a cons cell.

> (cons 1 2)
'(1 . 2)

And lists actually consist of pairs.

If you have ever written a linked list in C or some language that has pointers, you know that any list consists of nodes. Node stores information and a pointer to the next node. If there’s no node after the node it stores a pointer to a special location. It may point to itself, or to void or somewhere else, so you could distinguish the end of the list and stop traversing it. How many nodes does a linked list have to have to be called a list? The answer is one, but without a pointer, to the next special node you can’t use it, so actually, a minimal list consists of two elements: the element with the data and the empty node or void. So that thing also can be called a pair.

Well, at least this can be thought that way. And this actually aligns with what happens in Scheme! Look:

> (cons 1 '())
'(1)

Because lists are essentially nested pairs, the last pair in the list has '() (empty list) as its second atom. We can nest cons to produce longer lists:

> (cons 1 (cons 2 (cons 3 '())))
'(1 2 3)

Why does this work? Because, given that cons produces a pair, if we evaluate the code ourselves, we will get this:

'(1 . (2 . (3 . ())))

Let’s explain why '(1 . (2 . (3 . ()))) evaluates to '(1 2 3). My observations lead me to think that dot has two rules. The rules are as follows:

Dot must be the second to last element in the list,
Dot can be removed if after the dot comes an open parenthesis, alongside with open parenthesis and matching closing parenthesis.

Because (cons 3 '()) produces a pair of dot separated values, and the first one is 3, and the other is '() we get the '(3 . ()) result (because of how quote works we remove ' from '() and get () in the resulting pair). Now let’s apply the dot’s second rule to this code. First let’s remove dot in the most nested example: '(3 . ()) becomes '(3), so now we have '(1 . (2 . (3))). We can repeat this to (2 . (3), producing (2 3). Repeating this the last dot produces the simple list: '(1 2 3).

Note: The rules above may not be an actual thing though, because the dot is just syntax on top of the internal data structure, and it’s not stored in the list, but we’ll talk about it later.

So why does car and cdr work on lists, if those are actually working on pairs? Because car removes the element before the first dot, and cdr returns everything after it!

> (car '(1 . (2 . (3 . ()))))
1 ;; 1 is before the `. (2 . (3 . ()))'
> (cdr '(1 . (2 . (3 . ()))))
'(2 . (3 . ())) ; or '(2 3) for short

So every time we see something like (a b c), the interpreter actually sees (a . (b . (c . ()))). That’s why car and cdr work both with pairs and lists. Because lists consist of pairs! And that’s really beautiful.

And then I saw pure beauty

Yet, before I fully understood what I saw, I thought that my beautiful observation is not correct at all. But first, we need to know how to define procedures (functions) in Scheme:

> (define (sum a b) (+ a b))

This constructs a procedure sum that accepts two arguments a and b and produces the sum of those as a result with the following body: (+ a b). We can call this procedure this:

> (sum 1 3)
4

Because I was figuring out the rules behind the dot, which I’ve stated before, I was wondering what would happen if I use a dot in this syntax, like this:

(define (foo . bar) bar)

Not only this did not produce any errors, but this also works in a strange way:

> (foo 22)
'(22)

And at first glance, this looks like nonsense (well at least to me, remember, I didn’t finish SICP yet) and I expected it to fail because dot seem to be in the wrong place but it worked just fine.

“What the hell?” - I’ve thought. I’ve tried to pass two arguments to the function and it still worked:

> (foo 1 2)
'(1 2)

At this point, I was thinking that everything I thought I understand is wrong, but after some long thinking process, I understood this and was shocked at how beautiful this thing is.

So what’s the problem here? I’ve shown you the syntax for defining sum procedure: (define (sum a b) (+ a b), but what’s actually going underneath is that anonymous procedure with arguments (a b) is being bound to the name sum like so:

(define sum (lambda (a b) (+a b))

We can write all our procedures this way, but it is tedious to do. So previous form with parentheses around sum a b is a shorthand for this form.

So what’s happening when we put the dot into the function definition? To understand this we have to understand why everything is a list, and why we need primitive operations on lists.

Note: The following part of this post mostly relies on what I’ve already done with my own implementation of the scheme-like language and how it works, but I think that the real scheme works somewhat similarly. I will write more about my language later in this post and in the following post that is yet to come.

To convert procedure from form (define (foo bar) bar) to (define foo (lambda (bar) bar)) we need to know two things: procedure’s name, to put after define and it’s arguments to put into lambda. So we car the list (foo bar) and receive foo. Now we need the argument list and we cdr the (foo bar), and get (bar). Now we can put foo after define and (bar) after lambda: (define foo (lambda (bar) bar).

What will happen if we cdr the list which has a dot?

> (cdr '(foo . bar))
'bar

This means that we getting bar without parentheses around it. So doing all previous steps for (define (foo . bar) bar) expression gives us (define foo (lambda bar bar)) where lambda has no parentheses around it’s argument. Let’s try to use such a form of lambda directly:

> ((lambda x x) 1)
'(1)
> ((lambda x x) 1 2)
'(1 2)

This is a procedure with variable argument count! Why? Let’s car and cdr this code!

First we evaluate (lambda x x), which provides #<procedure> in Racket:

(#procedure 1)

Now we can car what we will apply, and cdr it’s arguments:

(car (#procedure 1))
#procedure
(cdr (#procedure 1))
'(1)

So the arguments of the function are '(1). Let’s recap what our lambda does. It was defined as (lambda x x), which means that it takes 1 thing and returns it immediately. The thing it takes is this '(1), that’s why we receive '(1) if we call this lambda with 1.

Doing the same for ((lambda x x) 1 2) provides '(1 2) as applied argument.

Why this doesn’t work for lambdas which have their argument in the parentheses? My guess is that it is because we provide a list of a certain length, and since we’re passing a list, the items in one list get bound to the items in another, in the lambda body.

OK, let’s recap what we know about Scheme so far:

~~Everything is a list;~~
Scratch that, everything is a pair, and every list is a sequence of nested pairs, ending with an empty list;
pair is a cons cell, which uses a dot to separate elements;
car and cdr use dot to decide what to take out of the cons cell.

Applying the Concepts Above In the Interpreter.

When I started writing this post I had no idea that I still will be writing it after I’ll have a working interpreter prototype. It is located here, and it uses the concepts developed above to evaluate the code. It’s not very good though.

The whole language is implemented around these base functions:

parse - parses text into GRScheme data structure,
eval - which is recursively traverses the tree,
apply - which applies the procedure to the tree,
get_first_subexpression - which takes first subtree from current tree,
get_rest_subexpression - which takes the rest subtrees as a tree.

Other functions in evaluator source code provide builtin facilities like car, +, lambda, define etc.

The first thing I have to say is that GRScheme doesn’t have such a thing as a dot. It is emulated to behave the same way as in a real Scheme. I suppose that regular Scheme does not have a dot either but in another sense.

After parsing the code we receive a tree structure that has nodes with data, and the tree directly represents the original code. This allows me to think about the Scheme code in a very native way, because it is already a tree, and it actually simplifies parsing a lot. I store parentheses as tree items, atoms are also tree items, but typically only parentheses have child nodes. This allows later traversing algorithms to check if the current node is a procedure call by just checking if the current node is an open parenthesis.

The get_first_subexpression and get_rest_subexpression functions are not exactly car and cdr as it may seem, but a more general ones that operate directly on the program AST. But because the dot is emulated in my implementation those currently also check for the . in the tree. If it is presented, get_rest_subexpression returns the subtrees without it.

This allows us to destruct pairs and lists with these two functions, but we can’t use those directly. Instead car and cdr are implemented in terms of these functions, also applying quoting on the result, but that’s a topic for a different post.

I’ve mentioned that dot is emulated in my implementation of the interpreter. What this means is that I store the dot directly inside the tree as a node. Let’s have a look at how '(a . b) looks in my language’s data structure:

Figure 1: '(a . b) Pair representation in GRScheme — Figure 1: `'(a . b)` Pair representation in GRScheme

Note: because I write quoted lists in the short form, there is a possibility of misunderstanding why the tree in the figures doesn’t directly match the input, e.g. has more parentheses. For instance '(a . b) is actually (quote (a . b)) and it is represented in the tree this way.

We have a tree, with a root node holding ( value, which has two nodes: quote and open parenthesis. This second ( has three nodes, which are a, ., and b.

In a real Scheme, this isn’t stored this way. Because pair is really a pair in Scheme, the cons cell actually stores two pointers to the data like this:

Figure 2: '(a . b) Pair in Scheme — Figure 2: `'(a . b)` Pair in Scheme

The lists are a bit different. In my implementation, the structure to which I parse the source code is a tree with any amount of child nodes per node, as seen in the pair example above That’s how '(1 2 3) list is stored in my implementation:

Figure 3: '(1 2 3) List in GRScheme — Figure 3: `'(1 2 3)` List in GRScheme

This is a tree, which is similar to the pair one, except instead of the dot we have 2. Let’s write list as a sequence of pairs '(1 . (2 . (3 . ()))) and see how it will be stored in my implementation:

We can see, that each subtree has exactly three nodes in it. However in a real scheme, it’s two nodes per tree, and the very same sequence of nested pairs is stored like this:

Figure 5: '(1 2 3) List in Scheme — Figure 5: `'(1 2 3)` List in Scheme

Because any list can be written as a sequence of pairs, we can see that any list in a real scheme is actually a binary tree. Not in the case of my language though.

The parser that I’ve written for GRScheme knows how to deal with dots, thanks to the dot rules I’ve described above, and these two constructions are parsed to exactly the same form:

And lists that don’t end with empty list, like '(1 2 . 3) and '(1 . (2 . 3)) are parsed like this:

Figure 7: List with pair at the end in GRScheme

Which, in a real Scheme, will look like this:

Figure 8: List with pair at the end in Scheme

So in the case of car in a real Scheme, we get 1, and in the case of cdr, we get the subtree '(2 . 3) when we apply it to this list and 3 when we apply it again to the previous result. This works the same way in my implementation, because we analyze the tree when we do cdr to see if there’s a dot in the second position, which may not be optimal, but works.

Dot in GRScheme

I’ve thought about dots during this post, and I think it is a very beautiful syntax solution to describe binary trees that do not end with a null pointer. However, this also means that dots are actually needed only for that case. Though we can emulate dots by storing those inside a tree directly, I don’t think that this is a good idea to do so.

Because the underlying structure in GRScheme is a graph, and not a binary tree, I think pairs are also meaningless, because they describe a particular tree case in Schemes, and in GRScheme we do not have this case. So one major difference from the real Scheme is that in GRScheme there are no cons cells, and some algorithms may depend on extracting the second element from the cons cell directly.

I’ve thought that it may be good to have a special procedure for that case, named along with car and cdr - Contents of Cons cell Representation. ccr should return the second element directly if it is the last element in the list, and return the rest of the list otherwise. This way it will work mostly the same as cdr but cdr always returns the list. I’m not sure how good this solution is, because this may be ambiguous when to use such a procedure, so there’s another procedure that may be used instead, which is cadr, which always takes the second element from the list.

I’m planning to write a second post about GRScheme diving more deeply into the details behind the language model I’ve developed. The current implementation is slightly wrong and has many problems, so I have to fix those first.

Comment via email