Clojure on Fennel part three: parsing

The two previous posts were not related to the compiler itself, but were kicked off by the start of the compiler development. I’d say this project was the reason that I made proper immutable data structures for Fennel and Lua. But now we’re finally going to talk about the compiler itself! And we’ll start with the parsing stage - transforming Clojure code into something that can be operated on by the compiler.

Manual single-pass parser and compiler

When I started this project, as I usually do, I underestimated the complexity at hand. My first attempt was a single-pass compiler that accepted a stream of characters of Clojure code and re-compiled it on the fly.

My idea was to write a simple recursive-descent parser that, when accepting a character would decide how to translate it to Fennel, assuming my fennel-cljlib library handled most of the semantics. So, for example, when encountering [ it would translate it to (cljlib.vector and thus [1 2 3] would descend to compile-vector and expand to (cljlib.vector 1 2 3).

The idea is simple, and at first I thought that it would suffice, but then I remembered that there are a lot of other things in Clojure that I have to consider. For example #_ - the ignore form syntax, and ^foo - the metadata syntax. This complicated things a lot, so I added initial support for them and moved on - because I decided that this will be a bootstrap compiler for a proper Clojure parser.

Clojure parsers

After some time I had a compiler that could compile most of the arbitrary code I threw at it. Of course, it was not exactly to the Clojure spec, but it worked and I was going to replace it anyway. So I started looking at available Clojure parsers written in Clojure.

Edamame

I started with Edamame. The choice was based on the fact that the Squint project uses it and my project is similar to Squint, or so I thought. This parser is also used in SCI - a Clojure interpreter, so it seemed like a good fit.

The first thing I did was make sure that Edamame at least compiles with my bootstrap compiler. The fact that it compiles doesn’t mean that it works - it just translated everything into Fennel, with bits of cljlib here and there. So the next step was to make it work.

But then I had a realization that I don’t need to do that at all - I could just parse Edamame with Edamame, and write a compiler against its output! Meaning, I could use Edamame’s parser directly until my compiler is complete enough to compile Edamame. And since my idea was to replace bootstrap compiler in the end anyway, I decided that it doesn’t make much sense to make it work enough to fully compile Edamame into a runnable Fennel code. So I used Edamame to parse Edamame, and got back…

… the same Edamame source.

Imagine my face when I realized that Edamame doesn’t produce an abstract syntax tree with all information, but instead it just produces Clojure data structures, such as lists, vectors, and other stuff. And all of the information about the source code sits inside metadata of these data structures, not in the output. Yes, it does transform some Clojure forms into lists, like @foo is transformed into (deref foo), but that’s still nothing to me, because Fennel has no list data type and its support for metadata is nowhere near compatibility with Clojure.

I get it, I should have read the readme with more attention, and don’t make assumptions based on wild guesses, so it’s fine. Still, I wasn’t dropping the idea to use an existing Clojure parser, and the bootstrap compiler combo. I just need a different parser.

Rewrite-clj

Next one I looked at was rewrite-clj.

This library does return a tree of objects. It’s still a list, but at least it contains nodes that can be translated into Fennel objects. However, it still had a few problems.

First, its source code is much bigger than that of Edamame. Its API is more complicated, and it expects the use of Zippers, which I don’t yet have in cljlib. I didn’t want such a big parser for my compiler, given that it’s job is quite simple - produce an AST that I will always walk in full.

Second, it’s still producing some Clojure data structures that I can’t cleanly map to Fennel, so it would mean that the compiler itself would depend on cljlib. And I didn’t want that.

Luckily for me, their user guide mentioned another parser.

Parcera

Finally, parcera.

Looking at the readme, I finally saw what I wanted to see:

(ns example.core
  (:require [parcera.core :as parcera]))

;;parse clojure code from a string
(parcera/ast (str '(ns parcera.core
                     (:require [clojure.data :as data]
                               [clojure.string :as str]))))

;; => returns a data structure with the result from the parser
(:code
 (:list
  (:symbol "ns")
  (:whitespace " ")
  (:symbol "parcera.core")
  (:whitespace " ")
  (:list
   (:keyword ":require")
   (:whitespace " ")
   (:vector (:symbol "clojure.data")
            (:whitespace " ")
            (:keyword ":as")
            (:whitespace " ")
            (:symbol "data"))
   (:whitespace " ")
   (:vector (:symbol "clojure.string")
            (:whitespace " ")
            (:keyword ":as")
            (:whitespace " ")
            (:symbol "str")))))

Tagged tree. Yes, it’s still a list, but at least this can be easily mapped to Fennel sequential tables. So I opened the code and…

…there’s nothing there! Turns out, this project uses an ANTLR grammar for parsing Clojure.

At this point I was quite frustrated with the fact that there’s no single Clojure parser that could provide something similar to what parcera provides, but written in pure Clojure. Or at least in Java, which I could hand-translate as a last resort. Yes, there’s a Clojure parser inside the Clojure project itself, written in Java, but my goal was to recompile an existing parser written in Clojure.

Grammar-based parser

After a bit of consideration, I decided that a grammar-based approach is not that bad of an idea after all. I couldn’t use the ANTLR grammar directly, but at least I had it, and it seemed complete enough.

In Lua there’s a great library called LPeg that can generate parsers at runtime from a PEG grammar, and I wanted to try it out for quite some time. But to do that I needed a PEG grammar for Clojure. ANTLR and PEG are quite different, but it’s not impossible to translate one grammar into another.

So that’s what I did:

# Clojure PEG Grammar
#
# Adapted from https://github.com/carocad/parcera (ANTLR4 grammar v0.11.6)
#
# Key differences:
#
# - Unlike ANTLR, PEG choices are ordered (/).
#   More specific alternatives need to come first.
# - No separate lexer phase.
#   All rules operate on characters directly.
# - ANTLR fragment rules are prefixed with '_'.
# - ANTLR's SENTINEL (catch-all for invalid tokens) has no direct PEG equivalent.
#   The parser will silently fail on invalid input.

# Parser rules

code        <- input* !.

input       <- ignore / form

ignore      <- whitespace / comment / discard

form        <- literal / collection / reader_macro

# sets and namespaced maps are under dispatch (they start with #)
collection  <- list / vector / map

list        <- '(' input* ')'

vector      <- '[' input* ']'

map         <- '{' input* '}'

# macro_keyword before keyword ('::' is longer than ':')
literal     <- macro_keyword / keyword / string / number / character / symbol

keyword       <- KEYWORD
macro_keyword <- MACRO_KEYWORD
string        <- STRING

# ordering: most-specific prefix first; LONG last (least specific)
number        <- HEXADECIMAL / OCTAL / RADIX / DOUBLE / RATIO / LONG

# NAMED_CHAR first (longest); UNICODE before UNICODE_CHAR (more specific)
character     <- NAMED_CHAR / OCTAL_CHAR / UNICODE / UNICODE_CHAR

symbol        <- SYMBOL

# unquote_splicing before unquote ('~@' vs '~')
reader_macro <- unquote_splicing
              / unquote
              / metadata
              / backtick
              / quote
              / dispatch
              / deref

metadata <- ((metadata_entry / deprecated_metadata_entry) ignore*)+
            ( symbol
            / collection
            / set
            / namespaced_map
            / tag
            / fn
            / unquote_splicing
            / unquote
            / conditional_splicing
            / conditional
            / deref
            / quote
            / backtick
            / var_quote
            )

metadata_entry            <- '^' ignore* (map / symbol / string / keyword / macro_keyword / conditional)

# #^ is deprecated syntax for metadata
deprecated_metadata_entry <- '#^' ignore* (map / symbol / string / keyword / macro_keyword / conditional)

backtick          <- '`' ignore* form

quote             <- "'" ignore* form

# negative lookahead prevents '~' from matching when '~@' follows
unquote           <- '~' !'@' ignore* form

unquote_splicing  <- '~@' ignore* form

deref             <- '@' ignore* form

# conditional_splicing before conditional ('#?@' vs '#?')
# tag last (most general: '#' + symbol)
dispatch <- conditional_splicing
          / conditional
          / set
          / namespaced_map
          / fn
          / regex
          / var_quote
          / symbolic
          / eval
          / tag

# no whitespace allowed between '#' and the delimiter
fn                    <- '#' list

regex                 <- '#' STRING

set                   <- '#{' input* '}'

namespaced_map        <- '#' (macro_keyword / keyword / auto_resolve) ignore* map

auto_resolve          <- '::'

var_quote             <- "#'" ignore* form

discard               <- '#_' ignore* form

tag                   <- '#' symbol ignore* form

conditional           <- '#?' whitespace* list

conditional_splicing  <- '#?@' whitespace* list

# ##Inf, ##-Inf, ##NaN  (allows arbitrary symbolic values)
symbolic              <- '##' ignore* SYMBOL

eval                  <- '#=' ignore* (symbol / list / conditional)

whitespace <- WHITESPACE

comment    <- COMMENT


# Lexical rules (terminals)

HEXADECIMAL <- _SIGN? '0' [xX] [0-9A-Fa-f]+ _BIG_INT?

OCTAL       <- _SIGN? '0' [0-7]+ _BIG_INT?

# base 2-36, then r/R, then digits in that base
RADIX       <- _SIGN? ([2-9] / [12] [0-9] / '3' [0-6]) [rR] [0-9a-zA-Z]+

RATIO       <- _SIGN? _DIGIT+ '/' _DIGIT+

LONG        <- _SIGN? _DECIMAL _BIG_INT?

# FRACTION EXPONENT before FRACTION alone (longer match first)
DOUBLE      <- _SIGN? _DECIMAL+ ((_FRACTION _EXPONENT / _FRACTION / _EXPONENT) 'M'? / 'M')

_BIG_INT  <- 'N'
_FRACTION <- '.' _DIGIT*
_EXPONENT <- [eE] _SIGN? _DIGIT+
_DECIMAL  <- '0' / [1-9] _DIGIT*


STRING <- '"' ([^"\\] / '\\' .)* '"'


# \p{White_Space} approximation
WHITESPACE <- [ \t\n\r\f,]+

COMMENT <- (';' / '#!') [^\r\n]*


# Named characters must not be followed by name-chars (word boundary)
NAMED_CHAR   <- '\\' ('newline' / 'return' / 'space' / 'tab' / 'formfeed' / 'backspace')
                !(_ALLOWED_NAME_CHARACTER / _DIGIT)

UNICODE      <- '\\' 'u' [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
                !(_ALLOWED_NAME_CHARACTER / _DIGIT)

# octal char: 0-377; longest alternative first
OCTAL_CHAR   <- '\\' 'o' ([0-3] [0-7] [0-7] / [0-7] [0-7] / [0-7])
                !(_ALLOWED_NAME_CHARACTER / _DIGIT)

# any single non-combining-mark character after backslash
UNICODE_CHAR <- '\\' ![\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF] .
                !(_ALLOWED_NAME_CHARACTER / _DIGIT)


# ::/ is NOT a valid macro keyword
MACRO_KEYWORD <- '::' (_SIMPLE_KEYWORD '/')? _SIMPLE_KEYWORD

KEYWORD       <- ':' (_SIMPLE_KEYWORD '/')? (_SIMPLE_KEYWORD / '/')

_SIMPLE_KEYWORD <- _KEYWORD_HEAD _KEYWORD_BODY+
                 / _KEYWORD_HEAD

_KEYWORD_BODY <- _KEYWORD_HEAD / ':'

_KEYWORD_HEAD <- _ALLOWED_NAME_CHARACTER / _DIGIT / [#'] / _SIGN


SYMBOL <- (_SIMPLE_SYMBOL '/')? (_SIMPLE_SYMBOL / '/')

# longer alternatives first for correct PEG matching
_SIMPLE_SYMBOL <- _SIGN _SYMBOL_HEAD _SYMBOL_BODY*
                / _ALLOWED_NAME_CHARACTER (_SYMBOL_HEAD / _DIGIT / ':') _SYMBOL_BODY*
                / _ALLOWED_NAME_CHARACTER
                / _SIGN

_SYMBOL_BODY <- _SYMBOL_HEAD / _DIGIT / ':'

_SYMBOL_HEAD <- _ALLOWED_NAME_CHARACTER / [#'] / _SIGN


# Characters allowed in names: everything EXCEPT reserved characters.
# Equivalent to ANTLR ~[\p{White_Space},()[\]{}"@~^;`\\:#'/0-9+-]
# Note: \p{White_Space} approximated with common whitespace chars here.
_ALLOWED_NAME_CHARACTER <- ![ \t\n\r\f,()\[\]{}"@~^;`\\:#'/0-9+\-] .

_SIGN  <- [+\-]

_DIGIT <- [0-9]

But of course, you can’t just load a PEG grammar into LPeg. Instead, you need to write the grammar programmatically, using LPeg’s API. I’ll spare you the details, but here’s a full source code of the parser.

After conversion work, I finally had a parser that could produce an AST:

>> (local {: parse} (require :impl.parser))
nil
>> (parse "(ns parcera.core
             (:require [clojure.data :as data]
                       [clojure.string :as str]))")
[:code
 [:list
  [:symbol "ns"]
  [:whitespace " "]
  [:symbol "parcera.core"]
  [:whitespace "\n             "]
  [:list
   [:keyword ":require"]
   [:whitespace " "]
   [:vector
    [:symbol "clojure.data"]
    [:whitespace " "]
    [:keyword ":as"]
    [:whitespace " "]
    [:symbol "data"]]
   [:whitespace "\n                       "]
   [:vector
    [:symbol "clojure.string"]
    [:whitespace " "]
    [:keyword ":as"]
    [:whitespace " "]
    [:symbol "str"]]]]]

Or not:

>> (parse "(ns parcera.core
             (:require [clojure.data :as data]
                       [clojure.string :as str])") ;; <- missing ')'
nil

Turns out, LPeg doesn’t have any way to report errors if the input string doesn’t match. It simply returns nil.

Luckily for me, there’s an additional library, called lpeglabel that can help with that. It implements an extension to the LPeg library that allows defining tagged failures:

>> (parse "(ns parcera.core
             (:require [clojure.data :as data]
                       [clojure.string :as str])")
unknown:3:49: Parse error: EOF while reading: Expected an ')'

Now I have a working Clojure parser with a usable output.

The parser

I mentioned before that Clojure metadata and special characters did introduce problems to my original parser. Well, with this grammar, this is no longer an issue:

>> (each [_ s (ipairs ["@foo" "^Foo bar" "^{:baz true} qux" "#_1 2"])]
     (pp (parse s)))
[:code [:deref [:symbol "foo"]]]
[:code [:metadata
        [:metadata-entry [:symbol "Foo"]]
        [:whitespace " "]
        [:symbol "bar"]]]
[:code [:metadata
        [:metadata-entry
         [:map [:keyword ":baz"] [:whitespace " "] [:symbol "true"]]]
        [:whitespace " "]
        [:symbol "qux"]]]
[:code [:discard [:number "1"]] [:whitespace " "] [:number "2"]]

Yes, the parser doesn’t transform @foo into (deref foo), but it’s OK, as I can handle it in the compiler easily now. Metadata is fully supported by the parser and is stripped in the compiler. And discard (#_) actually works in all of its supported ways, i.e. #_#_ 1 2 works fine.

Each parser result starts with :code, an entry point for the compiler. And the compiler itself is just a simple recursive descent over all of the nodes in a loop. So, that’s good, right?

Well, yes and no. There are some cons to this approach.

First, and foremost - this parser doesn’t support streaming, meaning I can’t stream chunks of code and compile them. Well, it’s not impossible, but currently it works on strings only. It is possible to write a second parser that would split text into top-level expressions, but for now it’s an optimization waiting to happen. Right now, this means that files will be parsed in full - having a greater footprint in memory.

Second, and arguably more important - this is a C library dependency. Both LPeg and lpeglabel are C libraries, meaning that the compiler now works only on Lua runtimes that can load object files. And the choice of systems where this compiler can run is now limited to those that can compile these libraries. Not that big of a deal, if you ask me, since C runs almost everywhere, but it’ll make the compiler difficult to ship as a no-dependency binary. It’s possible to compile a Fennel script into a binary and include all of its dependencies, but it would mean that I’ll have to produce binaries for many different operating systems and architectures. Or, alternatively, require people to install LPeg and lpeglabel through luarocks.

For now, I’m willing to accept this. There’s a port of LPeg in pure Lua, but it’s quite slow and doesn’t have support for lpeglabel features. There’s an open request for that, but it’s been sitting open for quite some time.

Other than that, the parsing is complete and we can look at the compiler part of the ClojureFnl project. But that’s gonna be in the next post.

Comment via email