GRScheme Design Part 1 - Parsing

At the end of my previous post, I mentioned that I will write more about GRScheme - the language that I’m working on. As of this moment, I’m rewriting it to be more robust and to fix some design flaws that make it impossible to do such things as tail call elimination. This series of posts will describe various things I’ve done to build a working interpreter and build a language around it.

But first, let’s define what GRScheme is, and what I expect it to be when I finish development:

GRScheme is an interpreted functional language with scheme-like syntax, which uses graph reduction as a basis for evaluation.
GRScheme is not a practical programming language built for wide use and is only a self-educational project.

Note: There are far more comprehensive and reliable lisps out there, so if you are interested in one for some practical use, you have plenty to choose from. E.g. Arc, Clojure, Common Lisp, GNU Guile, Racket, e.t.c., and GRScheme are not part of this list. Please keep this in mind.

The basic process of program execution should be:

Here, Reader is a special unit that reads user input from the REPL or file contents character by character and produces GRScheme-data, a special data structure that we will be used for evaluation. The evaluator consumes this data structure and produces the effect by using graph reduction to provide the final result.

There’s also dashed block, that contains Compiler, and sub-blocks Macro-expander, and Optimizer. Currently, I already have a working Evaluator and Reader, and thinking about a possible way to produce an intermediate result that will speed up evaluation. There definitively will be macro-expander at some point, though it doesn’t require real compilation to be applied. But this is gonna wait till the next posts, as I have to figure this out first.

In this post, I would like to talk about the data structure I use to represent programs because it is an interesting topic in the language, that I’m using for writing language core which is the Rust Programming Language.

Let’s look at an example program in GRScheme:

(+ 1 (* 2 3 4) (/ 9 (+ 1 2)))

Code Snippet 1: arithmetic expression

This is a mathematical expression that can be written as this formula:

$1+2\cdot3\cdot4+\frac{9}{1+2}$

Which produces such tree:

However, this tree is not presented like this in GRScheme-data format.

First, this is not a binary tree, and even if it was, we still would need to implement some data structure in Rust that would represent this tree. So what’s the definition of such data structure would be?

Creating the `Tree` in Rust

If we will look at most tree definitions we will see that usually, we define some amount of sub-nodes, typically 2, for binary trees, but in our case, that would be a variable amount of siblings, and a field for data. For example in Python such Tree can be defined like this:

class Tree:
    def __init__(self, data = None, siblings = None):
        self.siblings = siblings
        self.data = data

Note: I’m not a Python programmer, and I don’t know Python, so this may be not the best way to achieve what I’m describing, but this works, and similar to most tree representations I’ve seen.

And we can set siblings to new Tree’s like this:

root = Tree(data = 0)
root.siblings = [Tree(data = 1),
                 Tree(data = 2)]

There are many ways to declare trees in Rust, each with its own weaknesses and strong points, such as arena tree, or boxed tree, but I’ve decided to go for more or less classic C-like reference counted tree, because arena trees suffer from an expensive remove of items in the tree, and I have not figured out how to manage parents with Box. So if we try to translate the Python tree directly to Rust, we will get this:

struct Tree<T> {
    data: T,
    siblings: Vec<Tree<T>>,
}

impl <T> Tree<T> {
    pub fn new(data: T) -> Tree<T> {
        Tree {
            data,
            siblings: vec![],
        }
    }
}

fn main() {
    let mut root = Tree::new(0);
    root.siblings = vec![
        Tree::new(1),
        Tree::new(2)
    ];
}

This is a generic data structure that can hold any type as T. We can create a Tree of u32, f64, String or any other data just by calling new with arguments of different types. For example, calling Tree::new(0) and Tree::new("0") will create trees of types Tree<i32>, and Tree<&str> respectively. Here we create the tree that holds 0 as its root value and has two sub-nodes that hold 1 and 2 values. So this is pretty much a tree already!

Though, this is not very usable, because we can traverse the tree only to the leaves, and can’t go back to the root or previous node. Let’s add the parent field to the tree. In Python it would be like this:

class Tree:
    def __init__(self, data = None, siblings = None, parent = None):
        self.siblings = siblings
        self.data = data
        self.parent = parent
    def __str__(self):
        return f'[data: {self.data}, parent: {self.parent}, siblings: {self.siblings}]'
    __repr__ = __str__

root = Tree(data = 0)
root.siblings = [Tree(data = 1, parent = root),
                 Tree(data = 2, parent = root)]

print(root)

Note: I’ve also defined a __str__(self) method that will be useful for printing the whole data structure in a recursive fashion. This is not required but helps investigate how things are working.

Running the code above will print this (formatted for more readability):

[ data: 0,
  parent: None,
  siblings: [ [ data: 1,
                parent: [ data: 0,
                          parent: None,
                          siblings: [...] ],
                siblings: None ],
              [ data: 2,
                parent: [ data: 0,
                          parent: None,
                          siblings: [...] ],
                siblings: None ] ] ]

We can see that 0 has no parent, two sub-nodes 1 and 2 each of which has a parent 0 that has [...] sub-nodes. This is a recursive data structure, but we only descend it for sake of printing, yet we’re able to ascend it from any sub-node to root as well.

So what should we add in rust in order to achieve the same result? One could use a reference, but this will fill our code with a lot of lifetimes. The other way is to use Box indirection, or reference counting. If we try to use plain Box we would fail to specify the correct parent:

// Deriving something in Rust tells the compiler to generate methods for
// our data structure.  In this case, we derive the `Clone` trait
// implementation.
#[derive(Clone)]
struct Tree<T> {
    data: T,
    parent: Option<Box<Tree<T>>>,
    siblings: Vec<Tree<T>>,
}

impl <T> Tree<T> {
    pub fn new(data: T, parent: Option<Box<Tree<T>>>) -> Tree<T> {
        Tree {
            data,
            parent,
            siblings: vec![],
        }
    }
}

fn main() {
    let mut root = Tree::new(0, None);
    root.siblings = vec![
        Tree::new(1, Some(Box::from(root.clone()))),
        Tree::new(2, Some(Box::from(root.clone())))
    ];
}

Since we have to specify a type for our parent, and we actually can have no parent (root case), I use Option type that can either hold a None or Some(T). Inside Option we have a Box, which is a pointer to the heap allocated data, and inside the Box there’s the Tree<T> itself.

The problem in this code is that we do clone of the root and create our Box pointer out of it. So if we go to the parent of the node which holds 1 we’ll arrive at the copy of the root node which has no sub-nodes. This problem can be solved if we use the reference counted pointer to the tree instead of the Box:

use std::rc::Rc;

#[derive(Clone)]
struct Tree<T> {
    data: T,
    parent: Option<Rc<Tree<T>>>,
    siblings: Vec<Rc<Tree<T>>>,
}

// -- snip --

fn main() {
    let mut root = Tree::new(0, None);
    *root.siblings = vec![
        Tree::new(1, Some(root.clone())),
        Tree::new(2, Some(root.clone())),
    ];
}

This looks like it, but there’s still a problem. If we try to run this code, we’ll get this error:

error[E0277]: the size for values of type `[std::rc::Rc<Tree<{integer}>>]` cannot be known at compilation time
  --> src/main.rs:22:5
   |
22 |     *root.siblings = &[
   |     ^^^^^^^^^^^^^^ doesn't have a size known at compile-time
   |
   = help: the trait `std::marker::Sized` is not implemented for `[std::rc::Rc<Tree<{integer}>>]`
   = note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
   = note: the left-hand-side of an assignment must have a statically known size

This is an interesting error, and documentation suggests a solution for this, but we can fix this by using a reference. Using a plain reference in the structure definition will force us to specify lifetime everywhere, and I’m not yet good enough with lifetimes, since it’s a pretty new concept for me. Instead, let’s use RefCell:

#[derive(Clone)]
struct Tree<T> {
    data: T,
    parent: Option<Rc<RefCell<Tree<T>>>>,
    siblings: Vec<Rc<RefCell<Tree<T>>>>,
}

Oh, this compiles! Let’s print our tree! To do so, we can derive the Debug trait implementation for our structure and print it with {:?} placeholder in the println! macro. It works pretty much the same as in Python code, but is generated automatically:

use std::cell::RefCell;
use std::rc::Rc;

#[derive(Clone, Debug)]
struct Tree<T> {
    data: T,
    parent: Option<Rc<RefCell<Tree<T>>>>,
    siblings: Vec<Rc<RefCell<Tree<T>>>>,
}

// -- snip --

fn main() {
    let root = Tree::new(0, None);
    root.borrow_mut().siblings = vec![
        Tree::new(1, Some(root.clone())),
        Tree::new(2, Some(root.clone())),
    ];
    println!("{:?}", root);
}

Let’s run it:

Tree { data: 0, parent: None, siblings: [RefCell { value: Tree { data: 1, parent: Some(RefCell { value: Tree { data: 0, parent: None, siblings: [RefCell {value: Tree { data: 1, parent: ...

!!! gazillion lines !!!

... Some(RefCell { value
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

Whoa… This produces a stack overflow on printing. Because compiler generated Debug trait implementation is recursive. But our tree only has two sub-nodes, how could recursion blow the stack? And why we only see data: 1 then data: 0, then data: 1 again. Where’s data: 2 node?

So the problem here is that our recursive code prints the data for the root node, then go into the next recursion step on the 1 node, which contains the reference to the root, so recursion goes to the root first, and loops infinitely inside a cycle of two nodes. These are called cyclic references and those are bad for many reasons. One of those is that cyclic references will never be freed because their reference counter will never reach zero. Luckily for us, Rust provides a solution!

Instead of using Rc for parent, let’s use Weak:

use std::cell::RefCell;
use std::rc::{Rc, Weak};

#[derive(Clone, Debug)]
struct Tree<T> {
    data: T,
    parent: Option<Weak<RefCell<Tree<T>>>>,
    siblings: Vec<Rc<RefCell<Tree<T>>>>,
}

impl<T> Tree<T> {
    pub fn new(data: T, parent: Option<Weak<RefCell<Tree<T>>>>) -> Rc<RefCell<Tree<T>>> {
        Rc::from(RefCell::from(Tree {
            data,
            parent,
            siblings: vec![],
        }))
    }
}

fn main() {
    let root = Tree::new(0, None);
    root.borrow_mut().siblings = vec![
        Tree::new(1, Some(Rc::downgrade(&root))),
        Tree::new(2, Some(Rc::downgrade(&root))),
    ];
    println!("{:?}", root);
}

Code Snippet 2: Tree in Rust using Rc and Weak

Now if we run this code we get the proper tree representation:

RefCell {
    value: Tree {
        data: 0,
        parent: None,
        siblings: [
            RefCell { value:
                Tree { data: 1,
                       parent: Some((Weak)),
                       siblings: [] } },
            RefCell { value:
                Tree { data: 2,
                       parent: Some((Weak)),
                       siblings: [] } }] } }

And it looks almost exactly how we declare it in the main function, and certainly looks like what we’ve got in the Python example! However, those types are hard to type every time. We can solve this by using type:

use std::cell::RefCell;
use std::rc::{Rc, Weak};

type NodePtr<T> = Rc<RefCell<Tree<T>>>;
type WeakNodePtr<T> = Weak<RefCell<Tree<T>>>;

#[derive(Clone, Debug)]
struct Tree<T> {
    data: T,
    parent: Option<WeakNodePtr<T>>,
    siblings: Vec<NodePtr<T>>,
}

impl<T> Tree<T> {
    pub fn new(data: T, parent: Option<WeakNodePtr<T>>) -> NodePtr<T> {
        Rc::from(RefCell::from(Tree {
            data,
            parent,
            siblings: vec![],
        }))
    }
}

And here’s a simple recursive traversal for our tree, that prints it as an S-expression:

impl<T> ToString for Tree<T>
where
    T: ToString,
{
    fn to_string(&self) -> String
    where
        T: ToString,
    {
        fn build_string<T>(node: &Rc<RefCell<Tree<T>>>, string: &mut String)
        where
            T: ToString,
        {
            if !node.borrow().siblings.is_empty() {
                string.push_str("(");
            }
            string.push_str(&format!("{} ", node.borrow().data.to_string()));
            for n in node.borrow().siblings.iter() {
                build_string(n, string);
            }
            *string = string.trim().to_owned();
            if !node.borrow().siblings.is_empty() {
                string.push_str(")");
            }
            string.push_str(" ");
        };

        let mut string = format!("({} ", self.data.to_string());

        for c in self.siblings.iter() {
            build_string(c, &mut string);
        }

        string = string.trim().to_owned();
        string.push_str(")");

        string
    }
}

Now let’s try it. Remember that formula I’ve shown before? Let’s convert it to the tree:

fn main() {
    let plus = Tree::new("+", None);
    plus.borrow_mut().siblings = vec![
        Tree::new("1", Some(Rc::downgrade(&plus))),
        Tree::new("*", Some(Rc::downgrade(&plus))),
        Tree::new("/", Some(Rc::downgrade(&plus))),
    ];

    let mul = plus.borrow().siblings[1].clone();
    mul.borrow_mut().siblings = vec![
        Tree::new("2", Some(Rc::downgrade(&mul))),
        Tree::new("3", Some(Rc::downgrade(&mul))),
        Tree::new("4", Some(Rc::downgrade(&mul))),
    ];

    let div = plus.borrow().siblings[2].clone();
    div.borrow_mut().siblings = vec![
        Tree::new("9", Some(Rc::downgrade(&div))),
        Tree::new("+", Some(Rc::downgrade(&div))),
    ];

    let plus1 = div.borrow().siblings[1].clone();
    plus1.borrow_mut().siblings = vec![
        Tree::new("1", Some(Rc::downgrade(&plus1))),
        Tree::new("2", Some(Rc::downgrade(&plus1))),
    ];

    println!("{}", plus.borrow().to_string());
}

If we run this code, we’ll see this:

(+ 1 (* 2 3 4) (/ 9 (+ 1 2)))

Which directly represents our example from the beginning. This gets us to the point where we can use this data structure to represent our tree. However, making trees in such a way is impractical and tedious. I’ve implemented some more methods for the tree, such as add_child, parent, adopt_tree, replace_tree, and so forth. You can check those in the GRScheme source code repository if you want.

But in order to automate this process so we could specify an expression and get its tree representation we have to write the parser, or so-called reader, which will read input character by character, analyze it and produce a tree as the result. Reader is a crucial part of our interpreter since it serves as a translation unit between the user and the interpreter itself.

Writing `reader`

Since our tree is generic we need to define a proper data structure to hold both the data from input and additional information we may need to build the tree correctly. For that purpose let’s create another structure in reader module:

struct GRData {
    data: String,
}

Not as big as one might expect, and we could use String directly in Tree<String> fashion, but it’ll get bigger, I promise. We’ll need some extra data later on, but for now, let’s limit ourselves to String holding field named data.

So, how will we parse text? First, we need to write a function that will do the text processing:

fn parse(expression: &str) -> Result<NodePtr<GRData>, ReadError> {}

There’s a lot going on already. First, we create a function named parse, that takes one parameter expression of type &str and returns the Result which can either be NodePtr<GRData> for our tree, or ReadError. Result is a special type for functions that might fail, so we can store some error information in the Err(T) field. If the function succeeds we store the result in Ok(T). So in our case we’ll be storing error messages as ReadError, and use our GRData for Ok value. This is what I like about Rust because with Result and Option there’s no need for exceptions. We can carry our error through functions that expect that error and know how to deal with it. Now let’s think about what we’ll be parsing.

The Scheme’s syntax is quite simple - mostly everything is enclosed with parentheses, but there are some special cases as well.

The first one is comments. Comments are line-wise and written using the ; symbol, so If we’re finding that symbol in our input we can skip parsing those. With some exceptions.

The other special case is double quote ". If we see it in our input it means we’re inside the string, and we should put everything as a single piece of information until we meet another ". This also includes comments, because those can be part of a string. However, even with quotes, there’s a special case.

Escaping quotes inside strings can be done with the \ symbol like this: "this \"string\" contains a nested string". So we should keep an eye on this symbol and what’s going after it.

Now on to regular things in Scheme’s syntax.

In my implementation, all parentheses will represent the same thing, e.g. the list boundaries. However parentheses must match each other, so if the list was started with the ( it must end with ), and if it was started with [, or { it must be ended with ], or } respectively.

Quoting of items is possible with ' and `, right before the list. Unquoting can be done with ,, and ,@ before the list as well. These characters can not be used inside words except for strings.

I’ve mentioned “word” in the previous paragraph. Word is a sequence of non-reserved non-white-space characters. In our case everything that is not a space, tab, newline, semicolon, quote, double-quote, `, @, ,, escape character, all kinds of parentheses, and #, is going to be a valid word. For the purpose of a more general mechanism, strings are words as well. These are valid words in terms of our parser:

a
word
dashed-word
arrowed->word
"strings are treated as words"
slashed/word
1
123
-=.+_*^&%$!~@

These are invalid words:

paren(inside
colon,inside
unquote,@splicing-indide
semicolon;inside
quotes'inside`

Doing such parsing is nothing really special in Rust. It’s a bunch of logic around iterating over characters. The interesting part is tree construction. If you’re interested in parse source, you can look at it here, though it can be implemented better.

So the basic idea is that we parse character stream into words or items, and decide what to do with those. For example, if we meet ( we should treat it as a new list, and if we meet a as a name. For that purpose, I’ve written a function that analyses words and produces tokens. And this enum contains all tokens that we will need:

enum Token {
    Value,
    Eval,
    Apply,
    Quote { kind: &'static str },
    Symbol,
    None,
}

Value is a token for anything that is essentially a value, e.g. "4", 42, but since we’re parsing raw character data, names are treated as Values too. Don’t worry, we will not need this information when we will do the tree evaluation, so it’s safe to mix these two.

Eval is used for (, {, and [, and basically means that we will start new sub-tree here. Apply is a matching counterpart, which will say that we must go up to a parent.

Next is Quote and it is a bit more interesting because it features this kind: &'static str thing. What this essentially means is that we will hold an additional static string with quote kind in this enum field so we could later distinguish quote from quasiquote and friends. What’s also interesting is that we store unwrapped information in the tree, so we convert 'a into (quote a), and ,a to (unquote a) and that’s where kind comes in handy.

Symbol is used to ascend from special quote variants, I’ll talk about it a bit later.

And the last is None and it is used as an initial value and never used in the parser anymore.

Now to the tokenizer itself:

fn tokenize(&mut self, word: &str) -> Result<Token, ReadError> {
    let last_token = &self.last_token;

    let token = match word {
        "(" | "[" | "{" => Token::Eval,
        ")" | "]" | "}" => match last_token {
            Token::Quote { .. } => {
                return Err(ReadError::UnexpectedExpressionEnd {
                    line: self.line_num,
                    column: self.column_num,
                })
            }
            _ => Token::Apply,
        },
        "'" | "," | ",@" | "`" => Token::Quote {
            kind: match word {
                "'" => "quote",
                "," => "unquote",
                "`" => "quasiquote",
                ",@" => "unquote-splicing",
                _ => {
                    return Err(ReadError::InvalidSyntax {
                        message: format!(
                            "unexpected quote type \"{}\". line: {}, column: {}",
                            word, self.line_num, self.column_num
                        ),
                    })
                }
            },
        },
        &_ => match last_token {
            Token::Quote { .. } => Token::Symbol,
            _ => Token::Value,
        },
    };

    self.last_token = token.clone();
    Ok(token)
}

It’s pretty simple. If we receive either of the open parentheses, it’s always Eval. If we receive any of the close ones, it’s always Apply, unless the last one was Quote, e.g. '), then that’s an error. If we receive any of the quotes then it’s Quote, the error there would never happen, but Rust complains that other patterns are not handled, so it is there. And everything else is either Value or Symbol if the last one was Quote. This may not seem enough, but we can actually build a valid program tree out of this information.

So how do we manage tree structure with our Tree<T> type? First, we need to declare some methods to add things to trees in various ways. The first and the most obvious is add_child:

impl<T> Tree<T>
where
    T: Clone,
{
    pub fn new(data: T) -> NodePtr<T> {
        Rc::from(RefCell::from(Tree {
            data,
            parent: None,
            siblings: vec![],
        }))
    }

    pub fn add_child(node: &NodePtr<T>, data: T) -> NodePtr<T> {
        let new_node = Rc::from(RefCell::from(Tree {
            data,
            parent: Some(Rc::downgrade(node)),
            siblings: vec![],
        }));
        node.borrow_mut().siblings.push(new_node.clone());
        new_node
    }
}

We can see that something is going on here. This function takes two arguments - node, for a node to which we will add a child node, and data, which holds the data for a new child. Inside we create new_node to hold reference counted pointer to our new Tree, that has Some parent, no siblings and our data. After that we mutually borrow original node and push our new_node onto its list of sub-nodes. And in the end, we return the pointer to our new_node out of our function. This allows us to build trees in this fashion:

fn main() {
    let root = Tree::new(0);
    Tree::add_child(&root, 1);
    Tree::add_child(&root, 2);

    println!("{:?}", root);
}

Which is essentially the same tree as in previous example but far more compact. If we run it produces the same output (again, formatted for readability):

RefCell {
    value: Tree {
        data: 0,
        parent: None,
        siblings: [
            RefCell {
                value: Tree {
                    data: 1,
                    parent: Some((Weak)),
                    siblings: [] } },
            RefCell {
                value: Tree {
                    data: 2,
                    parent: Some((Weak)),
                    siblings: [] } }] } }

This also allows us to select to which node we want to add a child:

fn main() {
    let root = Tree::new(0);
    let one = Tree::add_child(&root, 1);
    Tree::add_child(&one, 2);

    println!("{:?}", root);
}

Which yields:

RefCell {
    value: Tree {
        data: 0,
        parent: None,
        siblings: [
            RefCell {
                value: Tree {
                    data: 1,
                    parent: Some((Weak)),
                    siblings: [
                        RefCell {
                            value: Tree {
                            data: 2,
                            parent: Some((Weak)),
                            siblings: [] } }] } }] } }

This means that as long as we keep track of the current node, we can build any tree, and if we can go up when needed, we can build complex trees without recursion in an iterative process.

So how do we add items that we’ve parsed and tokenized to the tree? First, we need to create the tree, and for that purpose, I would like to use progn as it is called in Emacs Lisp, or begin as it is called in Scheme:

fn parse(&mut self, expression: &str) -> Result<NodePtr<GRData>, ReadError> {
    let mut comment = false;
    let mut inside_word = false;
    let mut inside_string = false;
    let mut unquote = false;
    let mut paren_stack = vec![];

    let mut item = String::new();

    // we create the root of the tree which holds `(progn
    let mut tree = Tree::new(GRData::from_str("("));
    Tree::add_child(&tree, GRData::from_str("progn"));

    // here we save the tree root for later use
    let root = tree.clone();

    self.line_num = 1;
    self.column_num = 0;

    for c in expression.chars() {
        // parsing characters, producing tokens, constructing the tree
    }

    Ok(root) // after we've parsed we return the pointer to the root
}

Here we can see, that here we are defining some variables to hold parser state, such as that we’re inside a comment or string, or that we’re currently inside a word, and creating item for holding our words, and paren_stack for checking the balance of different parentheses.

After that, we create our tree, that holds (, and immediately add a child that holds progn. Then we clone tree pointer to root and do the parsing in for loop. After that, we constructed our tree and return it to the caller.

Let’s look at the part of the for loop code:

if !comment && !inside_string {
    match c {
        '(' | '[' | '{' => {
            paren_stack.push(c);
            item.push(c);
            inside_word = false;
        }
        // …  see the source for full code if you're interested
        '"' => {
            inside_string = true;
            inside_word = true;
        }
        ';' => {
            comment = true;
            continue;
        }
        ' ' | '\t' | '\n' => inside_word = false,
        _ => inside_word = true,
    }
    if inside_word {
        item.push(c);
    } else if !item.is_empty() {
        tree = self.add_to_tree(&tree, &item)?;
        item.clear();
        unquote = false;
    }
} else if inside_string {
    item.push(c);
    match c {
        '\\' => escaped = true,
        '"' => {
            if !escaped {
                inside_string = false;
                tree = self.add_to_tree(&tree, &item)?;
                item.clear();
            }
            escaped = false;
        }
        '\n' => {
            escaped = false;
            self.line_num += 1;
            self.column_num = 0;
        }
        _ => escaped = false,
    }
} else if comment && c == '\n' {
    self.line_num += 1;
    self.column_num = 0;
    comment = false;
}

If we’re not in a comment and not inside a string, we do parsing as normal. We push characters inside item to form a word until inside_word becomes false, which happens when we meet some word delimiter, such as white-space. Once we’re no longer inside word and item is not empty, we do tree = self.add_to_tree(&tree, &item)?, clear item and go further. If we’re inside string we’re basically waiting for the closing double quote, unless it is escaped with \. And if we’re inside comment, we simply wait for a newline.

So what’s this self.add_to_tree(&tree, &item)? does? Let’s see:

fn add_to_tree(&mut self, node: &NodePtr<GRData>, item: &str) -> Result<NodePtr<GRData>, ReadError> {
    match self.tokenize(item)? {
        Token::Quote { kind } => {
            let eval = Tree::add_child(node, GRData::from("(", true));
            Tree::add_child(&eval, GRData::from_str(&kind));
            Ok(eval)
        }
        Token::Eval => Ok(Tree::add_child(node, GRData::from_str("("))),
        Token::Symbol => {
            Tree::add_child(node, GRData::from_str(item));
            self.parent(node)
        }
        Token::Apply => self.parent(node),
        _ => {
            Tree::add_child(node, GRData::from_str(item));
            Ok(node.clone())
        }
    }
}

First, we get a token for the current item, then, depending on the token, we add a sub-tree, or get the parent of a node. and return the pointer to the tree for later use. This means, that if the current token was Eval, we add ( to the tree, and return the pointer to it. If the current token was Apply we return the pointer to the parent of the node.

Quote and Symbol are bit interesting. You see, since we unwrap the tree to full form, e.g. 'name is being stored as (quote name), we need to define a way of going up without parenthesis.

The problem here is that when we see this input (quote name) we have all information needed to build a tree - ( marks the root, quote is the first child, name is second, and ) marks that we have to go up. In the case of 'name, we see ' which we expand to (quote, but how do we actually go up without )? We can’t easily wrap the whole expression without implementing a look-ahead parser. That’s when Quote, and Symbol tokens are becoming useful.

When we see ' we mark that our current token is quote, and generate the (quote-kind tree. So for ' it will be (quote. for ` - (quasiquote, and so forth. Next, when we meet =symbol, it means that we have some name, that must be wrapped with a parenthesis, and we should take one extra up.

That’s why I’ve mentioned that we’ll need more info in our GRData than simply a String for our value. GRData also encodes that current node should take extra up:

struct GRData {
    data: String,
    extra_up: bool,
}

And that GRData::from("(", true) method’s second true parameter is written to this extra_up field. You can see that when we finish our work with Symbol we call self.parent(node), which actually checks if we need to take that extra one up, which is stored in its parent:

fn parent(&mut self, node: &NodePtr<GRData>) -> Result<NodePtr<GRData>, ReadError> {
    let mut current = node.clone();
    while let Some(parent) = Tree::parent(&current) {
        if parent.borrow().data.extra_up {
            current = parent;
            current.borrow_mut().data.extra_up = false;
        } else {
            return Ok(parent);
        }
    }
    Err(ReadError::UnexpectedExpressionEnd {
        line: self.line_num,
        column: self.column_num,
    })
}

This way it works both for lists and symbols because the quote’s parenthesis always has this extra up. Because of this we do not actually need a look-ahead parser, and can correctly wrap all different cases of quote short-hands with appropriate parentheses. E.g. `(a ,b 'c) transforms into (quasiquote (a (unquote b) (quote c))), thus making shorthand for various quotes just a syntax sugar.

To sum this up, let’s see how the interpreter sees this expression (over-complicated for observation reasons):

(define list (lambda x `(,@x)))

Assuming that Current is a pointer to current Node, and Tree is holding (, and that we always operating on Current, when getting parent, or adding new nodes, we can view what’s happening as this table:

Input	Token	Action To Perform
`(`	`Apply`	add `Tree`, `Current` ← `Tree`
`define`	`Value`	add `define` to `Current`
`list`	`Value`	add `list` to `Current`
`(`	`Apply`	add `Tree`, `Current` ← `Tree`
`lambda`	`Value`	add `lambda` to `Current`
`x`	`Value`	add `x` to `Current`
`	`Quote`	add `Tree`, `Current` ← `Tree`, add `quasiquote`
`(`	`Apply`	add `Tree`, `Current` ← `Tree`
`,@`	`Quote`	add `Tree`, `Current` ← `Tree`, add `unquote-splicing`
`x`	`Value`	add `x` to `Current`
`)`	`Apply`	get `Parent`, `Current` ← `Parent`
`)`	`Apply`	get `Parent`, `Current` ← `Parent`
`)`	`Apply`	get `Parent`, `Current` ← `Parent`
END	END	END

You can notice that we do not have the final ) here, because the first ( was not the part of the tree. Yes, since the parse function wraps expressions with progn we could close it in the very end, but actually, we don’t have to. Because we do not store ) in a tree at all, those only serve as a marker that we should get the parent of the current expression, and only ( is part of the tree. To illustrate:

The expression above can be simplified down to (define list (lambda x x)), but I wanted to show how we deal with different quotes, while also having an expression that works as builtin list.

This conducts how reader is done in GRScheme, though I did not go into much depth and details, noting only important parts. For instance, our GRData will grow further when we will be discussing evaluation. All code can be found in GRScheme source repository.

Pairs in GRScheme

Although I want to point out some thoughts about cons-cells and this data structure, we’ve been talking about. Because the resulting underlying data structure of GRScheme is a tree (which hopefully will evolve into a graph), and not a binary tree there is only one possibility of having pairs, or cons-cells, which is emulation through syntax.

The first version of GRScheme supported this via the dot syntax, which is a usual syntax for pairs in Scheme: '(1 . 2). This was fully covered in previous post in detail, so here I would only say, that I’ve decided to drop pairs in favor of direct equality of code and underlying data structures because that’s what we love about Lisp (or at least I do).

This also means that there’s no real need for cons, because the underlying data structure is a tree of a vector of trees, and not a singly-linked list, which means that we can push trees to any side of the vector, and iterate over those without car and cdr. These procedures are still very usable, but their behavior can be altered to a certain extent to match the data structure a bit better.

For example cons will work as in Clojure, concatenating leaf with the tree, producing a new tree:

> (cons '(1) '(2 3))
'((1) 2 3)

However if cons is used to produce a pair, an error is signaled:

> (cons 1 2)
; error

In complement to cons an append tree may be added:

> (append '(1) '(2 3))
'(1 (2 3))
> (append '(1) 2)
'(1 2)
> (append 1 2)
; error

car and cdr will remain, but can be described through another procedures, such as get and slice:

> (car '(1 2 3 4))
1
> (cdr '(1 2 3 4))
'(2 3 4)
> (get 2 '(1 2 3 4))
3
> (slice 1 2 '(1 2 3 4))
'(2 3)

Where car may be a shorthand for (get 0 list) and cdr is (slice 1 -1 list) or something like that (I’m not sure about -1 part).

In the next post, I will write more about procedures, and their application to this data structure, which will touch on such topics, as scope, closures, lambdas, and iterative tree traversal.

Comment via email

GRScheme Design Part 1 - Parsing

Creating the Tree in Rust

Writing reader

Pairs in GRScheme

Creating the `Tree` in Rust

Writing `reader`