Clojure on Fennel part two: immutable.fnl optimizations
So the next post will hopefully be about the compiler itself.
Unless I get distracted again.
Sike!
While I did some work on the compiler, I’m not feeling ready to talk about it yet. I want the post about it to be, well, more in-depth, so for now, let’s go back to immutable data structures. I got some comments on their performance, which can be summarized to “bad”.
And I agree! The performance is not great, especially when compared to plain Lua tables. However, my benchmarking process was a bit flawed, and I’ve improved it since.
This post won’t be long, but I wanted to make it anyway, to illustrate what I’m dealing with. Again, this is not the final version, I’m still working on possible optimizations as I write this, but some things have improved already, though not by much.
There were some low-hanging fruits that I could tackle:
- Vector had a local
fragmentfunction with pure arithmetic. I already have the samefragmentfunction in my bitops module that uses native rshift+band. The modulo operation is expensive. - A similar case of using bit operations instead of modulo was in HashMap’s
indexfunction. - Increased the popcount table to 256 entries instead of 16. Halves the loop iterations for typical bitmaps.
- Probably premature, but replaced most instances of
faccumulate,fcollect, etc. with plainforloops. These loops are quite hot, and Fennel compiles them to have additional assignment at each iteration. Should be free, but actually had a small impact. - More pre-allocations when creating Lua tables.
Using
[nil nil nil nil]tells the Lua compiler to pre-allocate four slots when creating the table, meaning it won’t resize once we insert elements. Tables in Lua resize by a factor of 2 and copy everything on resize, which is not great for performance. - Vector iterator is now stateful. And also fast.
- Internal fields for data structures used long, elaborate keys, which meant string comparison in certain cases of table lookup would be unnecessarily long. Again, changing them to shorter keys should not matter that much, but it did have some impact.
- HashMap branching factor was increased to 32. After some tests, it proved to be slightly better, despite what I said in my previous post.
- Hashes are now cached in a weak table. Helps avoid re-hashing the same values at the expense of memory. Not sure how well it will scale.
- Optimized array copying for sizes less than 4 by doing direct table construction. Avoids loop overhead.
With these changes in place (and some more I omitted), here are some of the benchmarks:
These images show per-commit changes to the performance. Really helped me track what was worth keeping, and what wasn’t.
Here’s the LuaJIT version of the benchmark:
Note I know, the results for some operations are worse than before all optimizations that took place. This is still a work in progress. Just shows how important relative measuring is.
The results are a mixed bag. Most operations improved by 1 or 2 milliseconds overall, meaning the per-operation cost is still quite high, compared to Lua tables. Yes, some are faster now, so I’m keeping these changes, but I can also see that for some operations, they actually got slower than they were before any optimization work. The reason for that may be that I fixed a few bugs along the way. Anyway, I tried to prioritize transient collection variants, as they will benefit from optimizations much more because of how they’re used.
Some bars are not meaningful for certain operations. E.g., I had changes that had nothing to do with Persistent Vector, but affected it anyway, so this benchmark is still a bit flawed. At least those differences are within the deviation range.
Anyhow, just wanted to show that optimization is hard, and not all changes yield the results you expect. In the case of this library, I don’t think there are more low-hanging fruits to optimize, or at least I don’t see any. For example, there are a lot of table allocations that are hard to get rid of in the current implementation. A deeper analysis is needed, probably by people with more Lua knowledge.
I’m still satisfied with some improvements I got, but we’ll see if more are on the way.