Benchmarking ND4J and Neanderthal

ND4J and Neanderthal are both libraries for fast matrix math on the JVM. ND4J targets Java users, while Neanderthal is aimed at Clojure users. Due to Clojure’s excellent Java Interop, it is quite easy to use ND4J in Clojure as well — even though it doesn’t provide an idiomatic Clojure API out of the box.

Dragan Djuric, the creator of Neanderthal, has recently conducted a micro-benchmark of both ND4J and Neanderthal. The operation under test is matrix multiplication, in particular, calling GEMM from Intel’s MKL library. The results have been quite unexpected, since both libraries shouldn’t do that much at that point, they basically pass on the call to MKL.

When the results had shown that Neanderthal is 24 times faster with the smallest input of a 4x4 matrix, and still 20% faster at 4096x4096, it made me curious to what is going on. Especially since his ND4J code is based on my original benchmarks.

When I originally compared ND4J and Neanderthal matrix multiplication speeds, the results left me wondering, since ND4J was slower at small sizes, yet faster at larger sizes. For this reason I never actually published any numbers. I originally based my comparison on Dragan’s benchmark code, but I didn’t notice that doubles were used there instead of floats. His new benchmark has cleared this confusion, and I’m glad that Dragan has shared both code and results.

In this post I try to validate Dragan’s results, show the detail that changes the numbers considerably, and rerun the benchmark after some additional optimizations have been added to ND4J.

Apple to Apple comparison: Changing some code

In Dragan’s benchmark, Neanderthal wins with by a large margin. So let’s take a look at the code to see if there is anything we can do to improve ND4J’s performance. Dragan uses this code to run the benchmark for nd4j:

(defn bench-nd4j-mmuli-float [^long m ^long k ^long n]
  (let [m1 (Nd4j/rand m k)
        m2 (Nd4j/rand k n)
        result (Nd4j/createUninitialized m n)]
    (quick-bench
     (do (.mmuli ^INDArray m1 ^INDArray m2 ^INDArray result)
         true))))

And while this looks correct, it actually has an issue. In ND4J arrays are C-ordered by default, i.e. their memory layout is as if you were to allocate an array in C. Yet, GEMM returns its result in F-order, i.e. with a memory layout that you would get if you allocated an array in Fortran. The difference is whether your two dimensional array is organized as [rows][columns] or [columns][rows]. If you pass a C-ordered array to take the result here, ND4J will notice this, create a new array in F-order, and then transfer the results to the original result array. All of this takes time, and especially in a micro-benchmark case where this is called thousands of times per second, memory allocation can become quite the bottleneck.

After changing the code to use F-ordered arrays it looked like this:

(defn bench-nd4j-mmuli-float [^long m ^long k ^long n]
  (let [m1 (Nd4j/rand m k)
        m2 (Nd4j/rand k n)
        result (Nd4j/createUninitialized (int-array [m n]) \f)]
    (quick-bench
     (do (.mmuli ^INDArray m1 ^INDArray m2 ^INDArray result)
         true))))

And when I ran it with a tiny matrix, it was a lot faster – 5 times faster – but it was also still about 2 times slower than running the same code from Java directly in my JMH benchmark suite. I’m not sure what is the cause for this. But, since we want to actually compare apples to apples, I decided to change the call itself. INDArray.mmuli has some additional checks to support some not-actually matrix multiplication use-cases.

After checking Neanderthal’s source code to see if it would still be a fair comparison, I moved on to using Nd4j.gemm directly. It is the closest in actual functionality to Neanderthal’s mm! call. Both of them do some basic parameter checking before passing them on to MKL. In the case of ND4J it also enforces the ordering, as explained earlier. The following is the benchmark code that I ended up using:

(defn bench-nd4j-gemm-float [^long m ^long k ^long n]
  (let [m1 (Nd4j/rand m k)
        m2 (Nd4j/rand k n)
        result (Nd4j/createUninitialized (int-array [m n]) \f)]
    (quick-bench
     (do (Nd4j/gemm ^INDArray m1 ^INDArray m2 ^INDArray result false false 1.0 0.0)
         true))))

And it turns out that this way of calling GEMM appears to be exactly as fast when called from JMH and from Criterium (the benchmarking library that provides us with the quick-bench method).

I’ve also created a pull request for Neanderthal, so the benchmark code there is closer to an apple to apple comparison.

First benchmarking results

Aside from this modification, I use the original benchmark code by Dragan, using Criterium as the benchmarking library and Neanderthal 0.19 and ND4J 1.0.0-beta. My computer is equipped with an Intel Core i7-6700K, running at 4.6GHz, 32GB RAM running at 2933MHz and uses Windows 10 as the operating system.

Since Windows likes doing Windows things in the background, I’ve ran the benchmark 10 times for each matrix size, alternating between Neanderthal and ND4J, and averaged the numbers afterwards.

Library	Size	Time per Op (ns)	Diff vs Neanderthal
ND4J	2x2	595	166 %
Neanderthal	2x2	223
ND4J	4x4	598	163 %
Neanderthal	4x4	227
ND4J	8x8	612	156 %
Neanderthal	8x8	239
ND4J	16x16	715	134 %
Neanderthal	16x16	305
ND4J	32x32	1312	69 %
Neanderthal	32x32	774
ND4J	64x64	4519	40 %
Neanderthal	64x64	3208
ND4J	128x128	19288	18 %
Neanderthal	128x128	16285
ND4J	256x256	120588	1 %
Neanderthal	256x256	118917
ND4J	512x512	907426	3 %
Neanderthal	512x512	880935
ND4J	1024x1024	7119631	5 %
Neanderthal	1024x1024	6803776
ND4J	2048x2048	53491781	7 %
Neanderthal	2048x2048	49876333
ND4J	4096x4096	397762380	-8 %
Neanderthal	4096x4096	437036465
ND4J	8192x8192	3480873900	0 %
Neanderthal	8192x8192	3452838100

The table shows that after a matrix size of 256x256 the performance of both libraries is within the margin of error of each other. But when using smaller matrices, it is apparent that Neanderthal indeed has a lower overhead. The difference isn’t as high as Dragan found, and in absolute terms about 350ns to 400ns may seem insignificant, yet we should still try to get it down to the bare minimum. This is even more true, if you consider that for those tiny matrices where this overhead is twice the time that Neanderthal needs.

Investigating the source of added overhead

In order to find out where some of that latency was hiding, I used an even lower level way of calling GEMM from Java. Since JavaCPP provides the bindings to the lower level libraries, and those bindings are public static methods, they can be also used directly. So, in order to find out if the source of this additional latency is on the Java side of things or on the native side, I used that call directly. The result: 231 ± 4 ns per operation, which looks very much like it is within the margin of error of Neanderthal. The additional latency has to be on the Java side.

With those numbers in hand @raver119 has taken to the code, and started investigating what may be the cause of it. He found one reason, and the change has already landed on master, and is therefore available on SNAPSHOT releases.

Repeating the Benchmark

With that change in place, I wanted to repeat the benchmark. Now something weird happened: Using the criterium based benchmark code, now both Neanderthal and ND4J were 2 times slower than before. I changed back to the old version to make sure it wasn’t due to the change in ND4J, but it stayed this way.

Interestingly, my own benchmarks with JMH didn’t suffer from this, so I set out to port the Clojure code to Java. Thus using the Clojure to Java Interop into this direction for the first time. While the direction Java to Clojure is quite a breeze, the other way around is pretty ugly as long as there is no specialized API around it. Anyway, I marched on, and figured out how to do it (for more comments on this see Oddities).

Using the numbers that I originally collected, I validated that Neanderthal was still as fast as it was using the criterium based benchmark. The following table shows the results using JMH as the benchmarking framework, Neanderthal 0.19 and ND4J 1.0.0-SNAPSHOT.

Library	Size	Time per Op (ns)	Diff vs Neanderthal
ND4J	2x2	309	32 %
Neanderthal	2x2	234
ND4J	4x4	319	31 %
Neanderthal	4x4	243
ND4J	8x8	322	29 %
Neanderthal	8x8	249
ND4J	16x16	420	31 %
Neanderthal	16x16	320
ND4J	32x32	1005	5 %
Neanderthal	32x32	958
ND4J	64x64	3786	29 %
Neanderthal	64x64	2925
ND4J	128x128	18816	19 %
Neanderthal	128x128	16683
ND4J	256x256	104342	-3 %
Neanderthal	256x256	108048
ND4J	512x512	775124	1 %
Neanderthal	512x512	765648
ND4J	1024x1024	6534687	8 %
Neanderthal	1024x1024	6031096
ND4J	2048x2048	44854846	6 %
Neanderthal	2048x2048	42211136
ND4J	4096x4096	317275196	-1 %
Neanderthal	4096x4096	319272117
ND4J	8192x8192	2783549850	8 %
Neanderthal	8192x8192	2571527782

We can see that especially for very small matrices the difference has closed a lot. Neanderthal still wins here though and is still about 30% faster when the overhead dominates the actual calculation. So, we still have to look for ways to reduce our overall overhead some more.

For larger sizes, as could already be seen in the first benchmark, the difference isn’t that clear. During preparation of this post, I’ve seen the numbers fluctuate for about 10% in any direction, so everything within 10% of each other is a draw for me at the moment.

Oddities

While preparing this blog post I ran into several odd behaviors.

One is the criterium benchmark seemingly getting slower over time, even as I’ve restarted JVMs. Only after several reboots it went back to normal behavior. I’m stumped as to why that may happen.

Then there were those 10% swings for both libraries, even when a benchmark was run for many interations. I may run a benchmark for quite a few iterations and get a number on which JMH is rather certain, showing a pretty low standard deviation, but once I repeat it I get a swing in either direction, again with a reported low standard deviation.

I guess that increasing the iteration time could reduce those swings a lot. But, given that I don’t want to compromise on some of the other options (i.e. using at least 2 forks, and using at least 10 benchmark iterations), using 1 minute for each benchmark would require over 8 hours of benchmarking. And since I’ve seen irregularities even with that on the larger sizes as well, I’d probably have to up that to 5 minutes, which would take almost two whole days to finish the benchmark for all sizes. Therefore, I’ll be content with saying that everything within 10% of each other is close enough to be considered a draw.

And I’m not even sure I could run the whole benchmark for that long. Originally, I wanted to use at least 10 seconds for each benchmark iteration, but, my computer crashes during the Neanderthal 64x64 benchmark if it runs for longer than 5 seconds. I guess it is due to the mild overclock, since it seems to cope better once I go back to stock speeds, but I didn’t have any issue with that using ND4J on that same size, and during the last 3 years that I’ve had this computer.

I found yet another odditiy while trying to make Neanderthal work in JMH. I ran into the issue that Clojure couldn’t cast an AOT compiled version of its pretty printer into itself, and the results that google spit out didn’t really help with the issue. In the end, I simply removed the AOT compiled version from the uberjar with a maven configuration option and that resolved the issue.

Repeating my benchmarks

I’ve updated my benchmarking_nd4j repository to contain everything that I’ve used for the second round of benchmarks. If you want to repeat them on your own machine, you can clone the repository:

git clone https://github.com/treo/benchmarking_nd4j.git

Build an Uberjar:

mvn clean package

And run it:

java -jar target/benchmarks.jar -f2 -i10 -wi 2 Neanderthal

This invocation will use 2 forks, 10 iterations per fork and 2 warm up iterations. It uses the default iteration time of just one second. By passing a name fragment, JMH will only start the benchmarks that start with that name. If you leave it out, it will run all benchmarks within the repository, which can take quite a considerable amount of time.

For more options you can run it as follows to print its help screen:

java -jar target/benchmarks.jar -h

Also, please notice, that since I’m using a SNAPSHOT version here, I’m not using the ND4J -platform artifact. For this reason, it will not work, if you upload the jar to a machine using a different operating system or CPU architecture.

Conclusion

I’m very grateful for Dragan to have conducted his benchmarks on both ND4J and Neanderthal. The investigation that it started has already borne fruit: we have already found and fixed some issues, as you can see in the second benchmark.

And while the difference even in the first benchmark wasn’t as dramatic once the result array ordering is properly set, it has still shown that Neanderthal indeed has a very low overhead and that to get the full performance out of ND4J you should know what you are doing.

There are still some points where ND4J could lose some more overhead, and we are investigating them, so I’m looking forward to repeating these benchmarks as soon as we have them figured out as well.

Categories

Tags

Home

Benchmarking ND4J and Neanderthal

June 26, 2018

Benchmarking ND4J and Neanderthal

Apple to Apple comparison: Changing some code

First benchmarking results

Investigating the source of added overhead

Repeating the Benchmark

Oddities

Repeating my benchmarks

Conclusion

Paul Dubs

Benchmarking ND4J and Neanderthal – Part 2

Quickstart with Deeplearning4J

Maven: Essentials