Andrey Listopadov

Clojure on Fennel part two: immutable.fnl optimizations

So the next post will hopefully be about the compiler itself.

Unless I get distracted again.

Sike!

While I did some work on the compiler, I’m not feeling ready to talk about it yet. I want the post about it to be, well, more in-depth, so for now, let’s go back to immutable data structures. I got some comments on their performance, which can be summarized to “bad”.

And I agree! The performance is not great, especially when compared to plain Lua tables. However, my benchmarking process was a bit flawed, and I’ve improved it since.

This post won’t be long, but I wanted to make it anyway, to illustrate what I’m dealing with. Again, this is not the final version, I’m still working on possible optimizations as I write this, but some things have improved already, though not by much.

There were some low-hanging fruits that I could tackle:

  • Vector had a local fragment function with pure arithmetic. I already have the same fragment function in my bitops module that uses native rshift+band. The modulo operation is expensive.
  • A similar case of using bit operations instead of modulo was in HashMap’s index function.
  • Increased the popcount table to 256 entries instead of 16. Halves the loop iterations for typical bitmaps.
  • Probably premature, but replaced most instances of faccumulate, fcollect, etc. with plain for loops. These loops are quite hot, and Fennel compiles them to have additional assignment at each iteration. Should be free, but actually had a small impact.
  • More pre-allocations when creating Lua tables. Using [nil nil nil nil] tells the Lua compiler to pre-allocate four slots when creating the table, meaning it won’t resize once we insert elements. Tables in Lua resize by a factor of 2 and copy everything on resize, which is not great for performance.
  • Vector iterator is now stateful. And also fast.
  • Internal fields for data structures used long, elaborate keys, which meant string comparison in certain cases of table lookup would be unnecessarily long. Again, changing them to shorter keys should not matter that much, but it did have some impact.
  • HashMap branching factor was increased to 32. After some tests, it proved to be slightly better, despite what I said in my previous post. More on that later.
  • Hashes are now cached in a weak table. Helps avoid re-hashing the same values at the expense of memory. Not sure how well it will scale.
  • Optimized array copying for sizes less than 4 by doing direct table construction. Avoids loop overhead.

With these changes in place (and some more I omitted), here are some of the benchmarks:

These images show per-commit changes to the performance. Really helped me track what was worth keeping, and what wasn’t.

Here’s the LuaJIT version of the benchmark:

Here’s another graph:

This is me testing how hash map branching factor affects operations. As can be seen, switching from 16 to 32 makes inserts slower on PUC Lua 5.5 but makes indexing faster. On LuaJIT both operations are faster.

Looking at the charts, the 7989c32 commit with that change (the second orange bar) — insertions are slower than the previous commit, and indexing is faster than the previous commit. Unfortunately, LuaJIT charts are noisy and I can’t confirm the speedup there. Something to do with instabilities in the LuaJIT tracer, I guess.

Note I know, the results for some operations are worse than before all optimizations that took place. This is still a work in progress. Just shows how important relative measuring is.

The results are a mixed bag. Most operations that improved did so by only 1 or 2 milliseconds overall, meaning the per-operation cost is still quite high, compared to Lua tables. Yes, some became faster, so I’m keeping these changes, but I can also see that for some operations, they actually got slower than they were before any optimization work. The reason for that may be that I fixed a few bugs along the way. Anyway, I tried to prioritize transient collection variants, as they benefit most from optimizations given how they’re used, and LuaJIT performance where possible.

Some bars are not meaningful for certain operations. E.g., I had changes that had nothing to do with Persistent Vector, but affected it anyway, so this benchmark is still a bit flawed. At least those differences are within the deviation range.

Anyhow, just wanted to show that optimization is hard, and not all changes yield the results you expect. In the case of this library, I don’t think there are more low-hanging fruits to optimize, or at least I don’t see any. For example, there are a lot of table allocations that are hard to get rid of in the current implementation. A deeper analysis is needed, probably by people with more Lua knowledge.

I’m still satisfied with some improvements I got, but we’ll see if more are on the way.

Update

After a lot more benchmarks, flamegraphs, and throwing stuff at the wall, I think I can’t do much more. I decided to update this post instead of writing another one (and really should have waited before publishing the original) because at that point, I thought there was nothing more to do. Now I think there really isn’t. So the results are in:

Figure 1: PUC Lua 5.5 (green - before, orange - after)

Figure 1: PUC Lua 5.5 (green - before, orange - after)

Figure 2: LuaJIT (green - before, orange - after)

Figure 2: LuaJIT (green - before, orange - after)

As can be seen, the changes are more dramatic on Lua than on LuaJIT. Mostly because it’s hard to optimize for LuaJIT without sacrificing performance on Lua, so I re-prioritized Lua, since LuaJIT is already much faster.

The final benchmark tables are below.

PUC Lua 5.5

  • PersistentVector
    Operation Lua table Persistent ratio per op ratio (old) per op (old)
    build 50000 0.18 ms 18.88 ms 102x 0.378 us 119x 0.440 us
    lookup random 0.48 ms 11.44 ms 24x 0.229 us 28x 0.266 us
    assoc random 0.31 ms 42.72 ms 139x 0.854 us 222x 1.4 us
    pop all 0.14 ms 14.56 ms 106x 0.291 us 150x 0.415 us
    iterate 0.60 ms 2.74 ms 5x 0.055 us 17x 0.196 us
  • TransientVector
    Operation Lua table Transient ratio per op ratio (old) per op (old)
    build 50000 0.19 ms 6.55 ms 35x 0.131 us 41x 0.154 us
    assoc random 0.31 ms 17.64 ms 57x 0.353 us 67x 0.408 us
    pop all 0.14 ms 5.74 ms 42x 0.115 us 51x 0.142 us
  • PersistentHashMap
    Operation Lua table Persistent ratio per op ratio (old) per op (old)
    insert 50000 2.01 ms 81.44 ms 41x 1.6 us 82x 3.2 us
    lookup random 0.92 ms 34.03 ms 37x 0.681 us 81x 1.7 us
    contains random 1.82 ms 33.08 ms 18x 0.662 us 49x 1.7 us
    dissoc all 0.90 ms 76.97 ms 86x 1.5 us 187x 3.3 us
    dissoc 10% 0.18 ms 9.81 ms 56x 2.0 us 112x 3.7 us
    iterate 1.91 ms 7.22 ms 4x 0.144 us 4x 0.146 us
    lookup structural keys 0.14 ms 2.86 ms 21x 0.572 us 146x 4.1 us
    insert collision-heavy 0.13 ms 6.50 ms 51x 1.3 us 134x 3.5 us
    mixed 80/20 1.92 ms 53.67 ms 28x 1.1 us 62x 2.3 us
  • TransientHashMap
    Operation Lua table Transient ratio per op ratio (old) per op (old)
    insert 50000 2.02 ms 38.91 ms 19x 0.778 us 44x 1.7 us
    lookup random 1.14 ms 35.56 ms 31x 0.711 us 86x 1.7 us
    dissoc all 0.89 ms 44.52 ms 50x 0.890 us 112x 2.0 us
    mixed 80/20 4.39 ms 44.87 ms 10x 0.897 us 22x 1.9 us

LuaJIT

  • PersistentVector
    Operation Lua table Persistent ratio per op ratio (old) per op (old)
    build 50000 0.11 ms 8.12 ms 71x 0.162 us 84x 0.177 us
    lookup random 0.06 ms 0.67 ms 11x 0.013 us 11x 0.013 us
    assoc random 0.04 ms 22.91 ms 559x 0.458 us 731x 0.585 us
    pop all 0.02 ms 10.77 ms 718x 0.215 us 657x 0.210 us
    iterate 0.02 ms 0.23 ms 12x 0.005 us 26x 0.011 us
  • TransientVector
    Operation Lua table Transient ratio per op ratio (old) per op (old)
    build 50000 0.05 ms 0.79 ms 16x 0.016 us 12x 0.012 us
    assoc random 0.04 ms 2.22 ms 55x 0.044 us 28x 0.022 us
    pop all 0.02 ms 0.35 ms 23x 0.007 us 50x 0.016 us
  • PersistentHashMap
    Operation Lua table Persistent ratio per op ratio (old) per op (old)
    insert 50000 0.79 ms 32.32 ms 41x 0.646 us 50x 0.989 us
    lookup random 0.34 ms 5.78 ms 17x 0.116 us 43x 0.248 us
    contains random 0.30 ms 5.91 ms 20x 0.118 us 40x 0.257 us
    dissoc all 0.23 ms 27.02 ms 119x 0.540 us 191x 0.894 us
    dissoc 10% 0.05 ms 5.89 ms 118x 1.2 us 121x 1.4 us
    iterate 0.07 ms 1.07 ms 16x 0.021 us 30x 0.040 us
    lookup structural keys 0.02 ms 2.41 ms 121x 0.483 us 119x 0.498 us
    insert collision-heavy 0.28 ms 5.20 ms 19x 1.0 us 36x 1.2 us
    mixed 80/20 0.50 ms 46.59 ms 94x 0.932 us 40x 0.513 us
  • TransientHashMap
    Operation Lua table Transient ratio per op ratio (old) per op (old)
    insert 50000 0.91 ms 30.92 ms 34x 0.618 us 11x 0.242 us
    lookup random 0.32 ms 34.90 ms 110x 0.698 us 44x 0.246 us
    dissoc all 0.23 ms 36.35 ms 162x 0.727 us 57x 0.271 us
    mixed 80/20 1.29 ms 40.71 ms 32x 0.814 us 11x 0.309 us

Conclusion

Yes, the benchmark shows that some operations slowed down, some sped up. Ideally, I’d like to have no slowdown, but that’s that. Still, I wouldn’t trust my LuaJIT benchmark, though - graphs show that for some weird reason unrelated changes affect performance in the modules not touched by those changes at all. So these changes are now merged into the main branch.

Main bottleneck now is table allocation and array copying - things I already spent optimizing a lot of time with mixed results. So I think I can’t do better than that. At least without major algorithmic restructuring.

With that, the next post should be about the compiler.