Zach Tellman delivered a really informative and practical unsession at Clojure Conj 2013 entitled “Predictably Fast Clojure.” It was described as:
An exploration of some of the underlying mechanisms in Clojure, and how to build an intuition for how fast your code should run. Time permitting, we’ll also explore how to work around these mechanisms, and exploit the full power of the JVM.
I’d like to share a few interesting things that I learned from this talk and that I subsequently verified and explored.
How to benchmark
It turns out that benchmarking is hard and benchmarking on the JVM is even harder. Fortunately, the folks at the Elliptic Group have thought long and hard about how to do it right and have written a couple of great articles on the matter. Hugo Duncan’s Criterium library makes it super easy to use these robust techniques.
All the benchmarks in this post were run on my dual-core 2.6 GHz Intel Core i5 laptop. The JVM was started with lein with-profile production repl
, which enables more aggressive JIT action at the cost of slower start times. If you try to use Criterium without this, you’ll get warnings spewed for every benchmark you run.
Surprising operations on lists, vectors, and tuples
The first thing that he discussed was the relatively poor performance of first
on vectors.
For the tests, I made the some simple collections:
1 2 3 |
|
And then I timed them each with first
and (nth coll 0)
:
The documentation says that first
“calls seq on its argument.” This is effectively true, but if you look at the source you’ll see that if the collection implements ISeq
, seq
doesn’t need to be called. As a result, the performance of first
on lists, which do implement ISeq
, is much better than on vectors, which don’t. Zach took advantage of this observation in his clj-tuple library and made sure that tuples implement ISeq
.
What’s really interesting is that you can use (nth coll 0)
to get the first element of a vector faster that you can with first
. Unfortunately, this only does well with vectors. The performance is abysmal when applied to lists, so you should stick to first
if you don’t know the data structure you are operating on.
The apparent slowness of seq
on a vector made me wonder about the empty?
function, which uses seq
under the hood:
1 2 3 4 5 6 7 |
|
If using seq
is so slow, perhaps we can get better performance by just getting the count of elements and testing if it’s zero:
Of course, this is a bad idea for lazy sequences and should probably be avoided, as we’ll incur a cost that is linear in the size of the sequence just to get the count.
I don’t think this will affect my day to day code, but it certainly is interesting and surfaced a bit more about how things actually work in Clojure.
Inconsistent protocol timings
This was a surprising one that also peeled back a layer on Clojure’s implementation. In Fogus’s Why Clojure might not need invokedynamic, but it might be nice, he explained:
Clojure’s protocols are polymorphic on the type of the first argument. The protocol functions are call-site cached (with no per-call lookup cost if the target class remains stable). In other words, the implementation of Clojure’s protocols are built on polymorphic inline caches.
The consequence of this is that we will see worse performance if the type of the first argument to a protocol’s method keeps changing. I made a simple test to see how significant this is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
g
calls f
on both its arguments and we expect f
to perform best when it’s consistently called on a single type:
The expectation was correct. There was some subsequent talk about whether the penalty of this cache miss was predictable. Theoretically, the cost could be unbounded if you extend the protocol with enough types and have horrible luck with the hash codes of those types colliding, but my understanding of the caching logic is that it will usually be the small constant that we observed here.
You can see why by taking a look at how the cache works in MethodImplCache.java. The hash code of the class is shifted and masked by values that form a simple perfect hash, which is determined by the maybe-min-hash
function. The use of a perfect hash means that we should see consistent lookup times for even moderately large caches.
In the rare case that a perfect hash can’t be found by maybe-min-hash
, the cache falls back to using a PersistentArrayMap
, which can have slightly worse performance. In any case, I don’t think there’s much to worry about here.
One neat thing I discovered while testing all of this is that you don’t suffer this cache-miss penalty if you declare that you support a protocol in your deftype
or if you reify
, but you do if you use extend-protocol
:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
My understanding is that the declaration of a protocol results in the creation of function objects and in a corresponding interface. When the function is called, the first thing it does when trying to dispatch is see if the first argument implements the interface for the protocol that declared the function in the first place. If it did, the corresponding method on the object is called. If it doesn’t implement the interface, it next uses the MethodImplCache and has the potential to suffer from the cache miss. What’s great is that if the object does implement the interface, the most recent entry in the cache is unaffected.
We can verify that the reified object and the instance of the type that was deftyped with the protocol both implement the interface and the other one doesn’t:
1 2 3 4 5 6 7 8 |
|
Determining if your type hints worked
Often when we want to squeeze every last bit of performance, we use type hints to avoid reflection and to force the use of primitives. Zach demonstrated how to use Gary Trakhman’s no.disassemble to inspect the byte code of a function directly from the REPL.
I haven’t gotten to play with it yet, but the ability to quickly compare the byte code between two implementations in the REPL looked amazing.
Thanks
Thanks to Zach Tellman for the informative presentation that motivated this and to David Greenberg for help investigating the protocol performance issues.
If there’s anything I got wrong, please let me know in the comments… thanks!