Overcolorization

Tree-sitter became more widespread and Emacs took notice and included a bunch of <lang>-ts-mode as alternatives to <lang>-mode into the core. This is good news and a welcome change, but I have some concerns about the approach.

When I first saw the Tree-sitter talk by Max Brunsfeld I was concerned that the language highlighting “fix” they’re talking about is too much. Here are the screenshots from the talk before and after Tree-sitter:

Figure 1: In some languages, variables have different colors depending on context, types have different colors, struct fields don’t have their own color, etc.

Here’s an important quote from that talk:

It might seem like a subtle improvement to you but I really think it actually achieves the goal of making it so that you can kind of get the structure of the code just from glancing at the colors.

Structure, from colors. I don’t know, I have never tried to infer structure from colors in the code. I do write mostly in lisps today, and to infer structure there I, personally, use semantic indentation, not parentheses, despite a popular belief. There are people who use colored parentheses to make it easier to see where the expression starts and ends, but in my practice, there’s no need for that, and colors can be used for other things.

Before lisps, I was writing mostly in ALGOL-based languages, and again, there aren’t many places where I could infer structure from colors. I don’t think I can do that with the provided screenshot of the code in the talk either. I can see the semantics better, yes, since types and field accesses are uniformly highlighted, but this tells me nothing about code structure - the indentation kinda does.

But I’m being picky here. The word structure is not the word I have problems with in particular. And I can get what they mean - you know that the type color is blue, so when you switch languages, and the language defines types at different positions, you can spot them easily, despite being in a different place. It can be seen when they show C and Go examples on the slides with Tree-sitter highlighting. While all of that is cool, I think that coloring everything is the wrong approach.

Before Emacs, I used NeoVim and Kakoune, and I was too obsessed with colors in my editor. NeoVim by default highlights many things, and I’ve even created additional rules to highlight struct fields in C code. Kakoune also allowed me to define my own extra rules and I used it plenty of times.

But when I moved to Emacs, I noticed that, unlike other editors, syntax highlighting is much more reserved. For example, Emacs usually highlights function definitions, but not function calls. At first, I was upset, and searched for ways to enable more syntax highlighting, but quickly learned that the underlying system is slow, and the more you highlight the choppier Emacs gets.

But after years with Emacs, I began to like the more reserved highlighting. By highlighting important parts, Emacs helps me focus on important parts of the code. A function call is not that important, you call them all the time - the definition, however, is. The macro call, on the other hand, is important, as it transforms the code inside its body, so it is better to know that something that looks like an ordinary call isn’t actually an ordinary call. Rust solved this by forcing the use of ! in macro names, but Emacs had a different solution to the same problem in lisps for a long time by using syntax highlighting.

The following screenshot demonstrates how Emacs highlights the macro calls. The calls to defun, defmacro, and when are calls to macros. They aren’t special language constructs, just macros, but they do affect the language, as they transform the code given to them into something else and that’s why it’s important to see that you’re calling a macro. Calls to the ordinary functions like apply, or log-level-satisfies aren’t highlighted because there is not much significance with them - they return some values and that’s it.

The macros above are both standard to and known by Emacs but users can define their own macros. And we need to somehow highlight these macros as well as already known ones. Tree-sitter parsers can do that in a very limited way - if the macro comes from the same file it can be detected easily. If not - you have to analyze all of the files imported by the current file. Which becomes harder, when files are in external dependencies, possibly in compressed archives, and so on.

That’s where the dynamic highlighting comes naturally. Emacs controls everything related to Emacs Lisp so it can do that without much effort, but that can’t be said about every other language.

This is how Emacs highlights the Clojure code for me:

Emacs can’t know anything about Clojure, and thus we can’t apply the same logic when it comes to highlighting user-defined macros and other runtime stuff. But, the CIDER package provides a facility to highlight things dynamically, based on the state of the program. In this particular case it helps to see that some names came from the global scope:

Figure 4: Dynamic highlighting shows a global var

Similarly, CIDER can highlight user-defined macros:

Figure 5: Dynamic highlighting is enabled, but the macro wasn’t made known to the REPL process

Same as in Emacs Lisp, Clojure macros are small compilers, so it’s better to indicate that certain calls are special:

Figure 6: After sending the macro to the REPL, dynamic highlighting shows the user-defined macro

Another use for dynamic highlighting can be found in languages with dynamic scoping, which today are mostly only lisps. It is important to know if the variable you’re using is a dynamically scoped one. Especially so, when let can both introduce locals and re-bind dynamically scoped variables. For example, again, in Emacs Lisp:

The issue here is subtle, but if foo uses dynamic-var internally, it will see the value changed to 20. Dynamic highlighting helps to indicate that dynamic-var is special and important:

Figure 8: Dynamic highlighting shows a global var

I’m using descriptive names, like local-var and dynamic-var but in actual code, it’s not as obvious, when something is dynamic. A few months ago, I encountered a weird bug in one of the projects I maintain, and it was related to dynamic var being used as a local one:

Figure 9: compile-command is a dynamic var — Figure 9: `compile-command` is a dynamic var

Here, the variable named compile-command is just a string, used to store a command to compile the source code. The name is innocent and semantically appropriate. However, it is a dynamic var introduced in the compile.el package, and changing compile-command can cause problems for any functions called from the body of the let block. In my case, some other package adviced one of the calls inside this let block, overriding the compile-command for them, breaking the package I maintain.

Sadly, by default, Emacs doesn’t highlight such variables. And, as I said, Tree-sitter can’t really help with any of the above cases without parsing all of the code. But it often can be achieved with an external package that consults with the runtime, like CIDER for Clojure and highlight-defined.el¹ for Emacs Lisp. Luckily the fix was easy, but if Emacs highlighted dynamic vars by default, this would not be an issue at all.

So my point is - highlight the important stuff, not just stuff.

Of course, for lisps all these nuances existed even before dynamic highlighting, so lisp hackers had to deal with them somehow. Thus, dynamically scoped vars use different naming conventions (e.g. *dynamic-var*), macros use different code indentation rules, and so on. These rules are still widely used today but highlighting is an additional visual aid, making important stuff easier to notice.

But let’s go back to the main topic. I’ve called this post “Overcolorization” for a reason, after all. I wanted to show you the importance of selective highlighting first, and I hope it will make more sense on why I think there is a problem.

One of the reasons, Emacs highlighting is reserved only for important parts of the code, is because it is slow. Tree-sitter makes everything much faster, and thus Emacs should be able to highlight everything now without slowdown. That fact worried me, as it would mean that code could suddenly become a mess of colors when the <lang>-ts-mode becomes the standard. I held hopes that due to Emacs being used by generally more conservative people, the old approach to syntax highlighting would remain, what would change is that actually finding the important parts of the code will be easier, because the parser gives us all of the info.

Recently, I was toying with the Elixir language, and decided to use the now inbuilt elixir-ts-mode, because the non-tree-sitter mode has issues with automatic indentation, and jumping across the delimiters. Tree-sitter should naturally fix this, and it indeed does, but what am I looking at?

Figure 10: elixir-mode on the left, elixir-ts-mode on the right — Figure 10: `elixir-mode` on the left, `elixir-ts-mode` on the right

Everything is purple. I’ve included the old mode on the left, so you can see what it looked like before I switched to the new one, which is actually suggested by the developers of the original elixir-mode. I mean, it’s pretty, and the colors do look nice in my personal opinion, but remember the quote I mentioned?

It actually achieves the goal of making it so that you can kind of get the structure of the code just from glancing at the colors.

The colors here clearly do not help the matter here. You could object by saying, that the colors should be used consistently, and I agree with you, here the main issue is that the font-lock-keyword-face is used for both keywords and method calls, as well as for parentheses. The font-lock-function-name-face is used both for function definitions, parameter names, and calls. But not for all calls, some use font-lock-keyword-face and I’m not talking about method calls: see raise. The font-lock-type-face is used both for types and :symbols. The operators aren’t highlighted only because the font-lock-operator-face is not set in my theme.

The highlighting is all over the place. Thanks to the speed and power of the tree sitter the code was made to look pretty, not informative. The same happened for a few other languages, but Elixir is probably the most offending in my opinion.

And I can’t fix that! Say, I want to remove highlighting of the method calls - but I can’t because for that I need to unset the face used by them, removing highlighting of the language keywords too. Same with the argument colors - they use function name colors, and I do want to see function definitions highlighted. I could override the treesit-font-lock-rules used by Elixir, only the elixir-ts--font-lock-settings is a private var, and it can silently be removed, renamed, and changed, breaking everything I’ve tried to fix.

I don’t want this post to be a bashing of elixir-ts-mode developers, I’m glad that the new mode was made, as it is better in many ways, but the problem with syntax highlighting exists, and other languages are in trouble too. In Lua, the lua-ts-mode does highlight table access but doesn’t highlight method calls which differs from table access by one symbol. Are table accesses more important than method calls? I think they both are equally unimportant.

The real problem here is that Emacs isn’t ready for semantic highlighting. It does now have Tree-sitter, which is a common interface between most text editors, so the problem isn’t with Emacs speed anymore. What it doesn’t have is enough standardized faces to use across all languages.

Or does it?

font-lock-comment-face - face used to highlight comments.
font-lock-misc-punctuation-face - face used to highlight miscellaneous punctuation.
font-lock-delimiter-face - face used to highlight delimiters.
font-lock-bracket-face - face used to highlight brackets, braces, and parens.
font-lock-punctuation-face - face used to highlight punctuation characters.
font-lock-property-use-face - face used to highlight property references.
font-lock-property-name-face - face used to highlight properties of an object.
font-lock-operator-face - face used to highlight operators.
font-lock-number-face - face used to highlight numbers.
font-lock-escape-face - face used to highlight escape sequences in strings.
font-lock-regexp-grouping-construct - face used to highlight grouping constructs in Lisp regexps.
font-lock-regexp-grouping-backslash - face for backslashes in Lisp regexp grouping constructs.
font-lock-regexp-face - face used to highlight regexp literals.
font-lock-preprocessor-face - face used to highlight preprocessor directives.
font-lock-negation-char-face - face used to highlight easy to overlook negation.
font-lock-warning-face - face used to highlight warnings.
font-lock-constant-face - face used to highlight constants and labels.
font-lock-type-face - face used to highlight type and class names.
font-lock-variable-use-face - face used to highlight variable references.
font-lock-variable-name-face - face used to highlight variable names.
font-lock-function-call-face - face used to highlight function calls.
font-lock-function-name-face - face used to highlight function names.
font-lock-builtin-face - face used to highlight builtins.
font-lock-keyword-face - face used to highlight keywords.
font-lock-doc-markup-face - face used to highlight embedded documentation mark-up.
font-lock-doc-face - face used to highlight documentation embedded in program code.
font-lock-string-face - face used to highlight strings.
font-lock-comment-delimiter-face - face used to highlight comment delimiters.

These are all of the faces defined in the font-lock.el package. There are separate faces for function definitions and function calls - good stuff. I don’t see the face that can be used for method calls specifically, but are there any major differences from ordinary function calls? It even has a face for variable usage, not just definition.

Thing is, many of these faces are new, and did not exist prior to Emacs 29.1. It’s good that they were added specifically to be used with Tree-sitter-based modes. You can say that I should submit a bug to Emacs about the elixir-ts-mode not using faces properly, and I did, but the problem isn’t specific to elixir-ts-mode.

The Elixir mode is just a convenient way to illustrate the problem The problem when lifting technical limitations can cause problems. A famous quote says “Limitations breed creativity”, and I feel like that’s exactly what was going on with the old method of syntax highlighting in Emacs. It was slow, and couldn’t handle too much highlighting or too complex rules, so it was kept to the minimum, highlighting what was absolutely necessary. But with Tree-sitter parsers being able to process files with 20k lines under 30ms the performance is out of the question. Well, mostly, you still need to draw everything on the screen, processing abstract syntax trees generated by these parsers.

I want to end this post with a message to fellow major mode maintainers.

Please, use syntax highlighting to highlight, not to color.

Syntax highlighting is a tool, and highlighting important parts of the code makes programming less convoluted. Pretty colors are nice to have, but if the point is only to make the code look fancy, there’s no real value for what we’re trying to do in the end - write better code. Let’s help each other at least at that.

BTW, highlight-defined.el by default tries to highlight everything defined, so I limited it down to only highlighting known dynamic vars. ↩︎

Comment via email