Andrey Listopadov

Clojure on Fennel part four: Parsing (again)

Other than that, the parsing is complete, and we can look at the compiler part of the ClojureFnl project. But that’s gonna be in the next post.

Sike!

Wasn’t my intention to fake out like this again, but while working on the compiler, I had an important realization that led me to redesign the parser completely. And it’s a bit of a shame, because I already had a decent part of the compiler working, being able to run the REPL and do various cool things with it. But, better sooner than later, I guess.

I had a good amount of the post about the compiler already written too, but now I’ll have to discard all of that. So instead, I decided to talk about the new parser by explaining why the old one failed me. And to do so, I’ll give you a brief look on how the compiler currently works with parsed code.

Old parser and compiler

So, in the previous post, we’ve reached a point where the parser can read code into a tagged tree. I spent a fair amount of time discussing how I wanted this specifically, and now it feels like I’m backpedalling, but tagged tree is not the way forward. Here’s the idea.

Currently, we’re reading this expression (+ 1 2) into:

[:code
 [:list
  [:symbol "+"]
  [:whitespace " "]
  [:number "1"]
  [:whitespace " "]
  [:number "2"]]]

The compiler then walks this tree, taking the first value of each node to decide what to do. First, it sees :code - that’s just an entry point, containing multiple top-level expressions. Next, it sees :list and dispatches to a function dedicated to compiling lists.

This function checks the first element of the list, sees that it is a :symbol, and checks whether this symbol is somehow special. While + doesn’t look that special, it is to the compiler, because Clojure’s + and Fennel’s + are not the same thing. In Clojure, + is a function, in Fennel, it is a special.

So the compiler replaces + with clojure_core_ns.add, and then proceeds over the rest of the forms in the list, recursively calling itself over each. In the end, we get this, written into a string builder:

(clojure_core_ns.add 1 2)

This is a somewhat simplified explanation, because I’m omitting scope resolution, symbol shadowing, and other such things that the compiler currently tracks. For actual specials, like Clojure’s let*, the story is a bit different too. The compiler has a dedicated compile-special function for all Clojure specials provided by cljlib as Fennel macros. And that’s where things started to go haywire.

Look at this code:

(def ^:private foo 42)

It’s read like this:

[:list
 [:symbol "def"]
 [:whitespace " "]
 [:metadata
  [:metadata-entry [:keyword ":private"]]
  [:whitespace " "]
  [:symbol "foo"]]
 [:whitespace " "]
 [:number "42"]]

Here’s a problem - the parser reads metadata ^:private, and metadata in Clojure is attached to symbols, so the grammar I use reads the next value after the metadata and wraps it into a single node:

[:metadata
 [:metadata-entry [:keyword ":private"]]
 [:whitespace " "]
 [:symbol "foo"]]

The compiler, while compiling this list, sees def as the first symbol and enters compile-special.def. But def expects a symbol to bind the value to, while here we have a different node type, called :metadata. So my compiler had to account for all cases where metadata can appear - and it wasn’t easy to do.

I tried to side-step this by always expecting metadata, because it can appear almost anywhere, and discarding it, because Fennel’s concept of metadata is a bit different. In which I mostly succeeded. But this wasn’t the only problematic thing to support.

Here’s another example, now from Edamame:

(defn- read-token
  "Read in a single logical token from the reader"
  ^String [#?(:clj rdr :cljs ^not-native rdr :cljr rdr) _kind initch]
  (loop [sb #?(:clj (StringBuilder.)
               :cljs (StringBuffer.)
               :cljr (StringBuilder.))
         ch initch]
    (if (or (whitespace? ch)
            (macro-terminating? ch)
            (nil? ch))
      (do (when ch
            (r/unread rdr ch))
          (str sb))
      (recur #?(:clj (.append sb ch) :cljs (.append sb ch) :cljr (.Append sb (str ch))) (r/read-char rdr)))))

It reads into this tagged tree:

[:list
 [:symbol "defn-"]
 [:whitespace " "]
 [:symbol "read-token"]
 [:whitespace "\n  "]
 [:string "\"Read in a single logical token from the reader\""]
 [:whitespace "\n  "]
 [:metadata
  [:metadata-entry [:symbol "String"]]
  [:whitespace " "]
  [:vector
   [:conditional
    [:list
     [:keyword ":clj"]
     [:whitespace " "]
     [:symbol "rdr"]
     [:whitespace " "]
     [:keyword ":cljs"]
     [:whitespace " "]
     [:metadata
      [:metadata-entry [:symbol "not-native"]]
      [:whitespace " "]
      [:symbol "rdr"]]
     [:whitespace " "]
     [:keyword ":cljr"]
     [:whitespace " "]
     [:symbol "rdr"]]]
   [:whitespace " "]
   [:symbol "_kind"]
   [:whitespace " "]
   [:symbol "initch"]]]
 [:whitespace "\n  "]
 [:list
  [:symbol "loop"]
  [:whitespace " "]
  [:vector
   [:symbol "sb"]
   [:whitespace " "]
   [:conditional
    [:list
     [:keyword ":clj"]
     [:whitespace " "]
     [:list [:symbol "StringBuilder."]]
     [:whitespace "\n               "]
     [:keyword ":cljs"]
     [:whitespace " "]
     [:list [:symbol "StringBuffer."]]
     [:whitespace "\n               "]
     [:keyword ":cljr"]
     [:whitespace " "]
     [:list [:symbol "StringBuilder."]]]]
   [:whitespace "\n         "]
   [:symbol "ch"]
   [:whitespace " "]
   [:symbol "initch"]]
  [:whitespace "\n    "]
  [:list
   [:symbol "if"]
   [:whitespace " "]
   [:list
    [:symbol "or"]
    [:whitespace " "]
    [:list [:symbol "whitespace?"] [:whitespace " "] [:symbol "ch"]]
    [:whitespace "\n            "]
    [:list
     [:symbol "macro-terminating?"]
     [:whitespace " "]
     [:symbol "ch"]]
    [:whitespace "\n            "]
    [:list [:symbol "nil?"] [:whitespace " "] [:symbol "ch"]]]
   [:whitespace "\n      "]
   [:list
    [:symbol "do"]
    [:whitespace " "]
    [:list
     [:symbol "when"]
     [:whitespace " "]
     [:symbol "ch"]
     [:whitespace "\n            "]
     [:list
      [:symbol "r/unread"]
      [:whitespace " "]
      [:symbol "rdr"]
      [:whitespace " "]
      [:symbol "ch"]]]
    [:whitespace "\n          "]
    [:list [:symbol "str"] [:whitespace " "] [:symbol "sb"]]]
   [:whitespace "\n      "]
   [:list
    [:symbol "recur"]
    [:whitespace " "]
    [:conditional
     [:list
      [:keyword ":clj"]
      [:whitespace " "]
      [:list
       [:symbol ".append"]
       [:whitespace " "]
       [:symbol "sb"]
       [:whitespace " "]
       [:symbol "ch"]]
      [:whitespace " "]
      [:keyword ":cljs"]
      [:whitespace " "]
      [:list
       [:symbol ".append"]
       [:whitespace " "]
       [:symbol "sb"]
       [:whitespace " "]
       [:symbol "ch"]]
      [:whitespace " "]
      [:keyword ":cljr"]
      [:whitespace " "]
      [:list
       [:symbol ".Append"]
       [:whitespace " "]
       [:symbol "sb"]
       [:whitespace " "]
       [:list [:symbol "str"] [:whitespace " "] [:symbol "ch"]]]]]
    [:whitespace " "]
    [:list [:symbol "r/read-char"] [:whitespace " "] [:symbol "rdr"]]]]]]

Yes, it’s abysmal, but bear with me. I had to work with this, after all, thinking that this is a blessing.

The compiler sees defn- and enters the compile-special.defn- function. Defn expects a function name, then a vector for its arguments. Here, instead of the vector, we have metadata node again. This is fine, since I just mentioned above that I managed to sidestep this problem.

Since this is an arglist, we need to compile it in a special way, adding its symbols to the function scope, etc. However, instead of arguments, we see the conditional node:

[:vector
 [:conditional
  [:list
   [:keyword ":clj"] [:whitespace " "] [:symbol "rdr"] [:whitespace " "]
   [:keyword ":cljs"] [:whitespace " "] [:metadata [:metadata-entry [:symbol "not-native"]] [:whitespace " "] [:symbol "rdr"]] [:whitespace " "]
   [:keyword ":cljr"] [:whitespace " "] [:symbol "rdr"]]]
 [:whitespace " "]
 [:symbol "_kind"]
 [:whitespace " "]
 [:symbol "initch"]]

I didn’t think it was allowed in Clojure to use conditional reading inside forms like this. But apparently it’s OK, and my compiler failed to deal with it.

So now, any node can not only be a metadata node, but also a conditional node. This complicates things, but at this point I’m still thinking that I can persevere.

So I did. I added support for almost all specials provided by cljlib. The main thing that was left to do were macros. And then it hit me:

I can’t do macros like that!

Why? Why, of course, because macros don’t emit tagged trees that my compiler understands. They emit code!

Here’s a simple macro:

(defmacro unless [test & body]
  `(when (not ~test)
     ~@body))

We can parse it:

[:list
 [:symbol "defmacro"] [:whitespace " "] [:symbol "unless"] [:whitespace " "]
 [:vector
  [:symbol "test"] [:whitespace " "] [:symbol "&"] [:whitespace " "] [:symbol "body"]]
 [:whitespace " "]
 [:backtick
  [:list
   [:symbol "when"] [:whitespace " "]
   [:list [:symbol "not"] [:whitespace " "] [:unquote [:symbol "test"]]]
   [:whitespace " "] [:unquote-splicing [:symbol "body"]]]]]

We can probably even compile it to something that we could then call during the compiler step. However, what will (unless true (println 42)) thing return?

(when (not true) (println 42)), of course. It’s a Lisp macro, what else did you expect?

But what does the compiler expect?

[:list
 [:symbol "when"]
 [:whitespace " "]
 [:list [:symbol "not"] [:whitespace " "] [:symbol "true"]]
 [:whitespace " "]
 [:list [:symbol "println"] [:whitespace " "] [:number "42"]]]

Oh, it wants this.

I have to convert macro’s output into a string, parse it, and feed it into the compiler if I want macros to work at all. And that’s BAD. If only I realized this sooner!

This was the final nail in the coffin of the tagged tree approach for my project. I knew I had to rewrite the parser to emit actual data structures that I’ll be able to emit from macros as well, and compile them.

New parser reader

So I decided that I need a proper lisp reader that will produce data structures:

;; Welcome to Fennel Proto REPL 0.6.4-dev
;; Fennel version: 1.7.0-dev
;; Lua version: PUC Lua 5.5
;; Work directory: ~/Projects/fennel/ClojureFnl/
>> (local reader (require :impl.reader))
nil
>> (reader.read-string "(def ^:private foo 42)")
(def foo 42)
>> (local {: meta : second} (require :clojure.core))
nil
>> (meta (second (reader.read-string "(def ^:private foo 42)")))
{:private true}

As can be seen, the new reader module provides the read-string function that produces data structures. Notably, there are no longer any metadata nodes - metadata is assigned to the symbol foo in this case.

Same goes for the reader conditionals:

>> (reader.read-string "[0 #?(:clj 1 :cljfnl 2) #?@(:clj [2 3] :cljfnl [3 4]) 5]")
[0 2 3 4 5]

These are now correctly spliced at read time.

This also solved macro problems for the most part:

>> (reader.read-string "`(+ ~x ~@y ~z)")
(clojure.core/seq
 (clojure.core/concat
  (clojure.core/list (quote clojure.core/+))
  (clojure.core/list x)
  y
  (clojure.core/list z)))

And it matches what Clojure itself does:

Clojure 1.12.5
user=> (read-string "`(+ ~x ~@y ~z)")
(clojure.core/seq
 (clojure.core/concat
  (clojure.core/list (quote clojure.core/+))
  (clojure.core/list x)
  y
  (clojure.core/list z)))

Due to this change, the compiler doesn’t have to know about syntax quote (`) at all, and thus macros will correctly return these kinds of lists.

So, now we can read Clojure code into data structures. Previously, the parser returned [:code A B ...] kind of result, where A, B, and the rest are all top-level forms. New reader also does this, but returns them as a list:

>> (reader.read-string-all "1 \"str\" :keyword ::namespaced #:map{:key :val}")
(1 "str" :keyword :user/namespaced {:map/key :val})

However, there’s a problem.

Consider the ::namespaced keyword. Both my reader and Clojure read it as :user/namespaced, but this is because the default namespace in the REPL is user. For my parser, it’s mostly an arbitrary choice because it doesn’t know anything about runtime.

Clojure 1.12.5
user=> (read-string "::x")
:user/x
user=> (ns foo)
nil
foo=> (read-string "::x")
:foo/x

And when compiling a file, the file might have an ns declaration, or even multiple of them:

(ns foo)

(println ::x)

(ns bar)

(println ::x)

If we were to run this code, we would see :foo/x then :bar/x.

My parser currently reads all input, meaning it transforms all top-level forms from text to data. However, this also means that it doesn’t understand anything about namespaces.

Previously, this was fine because the compiler handled this - the tagged tree didn’t resolve anything. Currently, however, we need to construct namespaced keys, and we need to know the namespace. But it’s not possible unless we parse the input expression by expression, instead of all at once.

Hence, I had to update the grammar so it could parse expression by expression. This way, I could read source code form by form, and compile each form separately. And if the compiler encounters an ns declaration, it could update some state, so the reader would know how to read namespaced keywords, and other things that may include namespaces.

But we still need to pass this state to the reader. And my reader is based on a PEG grammar.

Luckily for me, the lpeg library I use has a Carg function that can pass additional arguments into the PEG parser. This way I can write my transformation function T:

(fn T [tag patt]
  (/ (* (Ct (* (Cc tag) patt)) (Carg 1))
     (fn [node state]
       (case node
         ;; ----8<----
         [:macro-keyword bo data]
         (read-macro-keyword data bo state)
         [:conditional data]
         (read-conditional data state)
         [:conditional-splicing data]
         (conditional-splicing data state)
         ;; ----8<----
         _ _))))

This state can be passed anew after reading each expression, or mutated in place - either way, the reader now knows how to access state. The reader itself is still stateless.

The reader now supports all Clojure syntax and produces data structures:

1                                                   ;; 1
0.5                                                 ;; 0.5
1e10                                                ;; 10000000000.0
1/2                                                 ;; 1/2
1N                                                  ;; 1N
1.33333333333333333333333333333333M                 ;; 1.33333333333333333333333333333333M
0xFF                                                ;; 255
017                                                 ;; 15
16rFF                                               ;; 255
\c                                                  ;; "c"
"string"                                            ;; "string"
\u0041                                              ;; "A"
\o101                                               ;; "A"
:keyword                                            ;; :keyword
:namespaced/keyword                                 ;; :namespaced/keyword
::auto-keyword                                      ;; :user/auto-keyword
symbol                                              ;; symbol
namespaced/symbol                                   ;; namespaced/symbol
#'var                                               ;; (var var)
'quoted-symbol                                      ;; (quote quoted-symbol)
(list)                                              ;; (list)
[vector]                                            ;; [vector]
{hash map}                                          ;; {hash map}
#:namespaced{:map 42}                               ;; {:namespaced/map 42}
#::{:auto :namespaced}                              ;; {:user/auto :namespaced}
#{hash set}                                         ;; #{set hash}
#"regexp"                                           ;; "regexp"
@dereferencing                                      ;; (clojure.core/deref dereferencing)
#(+ %1 %2)                                          ;; (fn* [p__0__ p__1__] (+ p__0__ p__1__))
^:meta data                                         ;; data
#_discard                                           ;;
[#?(:cljfnl "reader") #?@(:cljfnl [:conditionals])] ;; ["reader" :conditionals]
nil                                                 ;; nil
true                                                ;; true
false                                               ;; false
##NaN                                               ;; .nan
##Inf                                               ;; .inf
#inst "2022"                                        ;; #<inst: 2022-01-01T00:00:00.000-00:00>
#uuid "c6e8050a-789b-4305-b518-8f0f7c31da7a"        ;; "c6e8050a-789b-4305-b518-8f0f7c31da7a"

Note about data structures: even though we’re working in Fennel, the produced maps, vectors and lists are custom data structures that are implemented in the cljlib library. These are implementations of persistent data structures I’ve talked about in part 2.

Further work

With that, I can work on the compiler, now for real. It’s a shame that I basically have to do all the work on the compiler again from scratch, because the underlying data has changed so much, but this simplifies a lot of things, so I’m OK with that.

One thing that the reader still doesn’t support yet is read-time evaluation with #=expr. This requires a working compiler, and I’ll have to first implement it, then integrate it back into the reader.

Another thing I was thinking about was to drop the LPEG parser altogether. Maybe, when I have the compiler working and have support for all Clojure runtime semantics in place, I’ll make a fork of the Edamame parser and replace the current reader with it, adding support for ClojureFnl via reader conditionals. This will remove a C library dependency, which is a good thing for distribution. I still have another C library for arbitrary-precision numbers, but it can be worked around.

But, lesson learned - when implementing a lisp, even if it is just a transpiler, don’t try to cut corners, and make a proper reader.

Next post, for sure, will be about the compiler!