Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Provide SSE-optimized JNI functions

See original GitHub issue

Previously, I did some testing on doing matrix calculations in native functions via SSE/AVX instructions: https://github.com/JOML-CI/JOML/issues/30

That turned out to be slower than the calculations with standard scalar arithmetic operations in Java. This was due to the approach of “batching” all operations. In order for that batching to work, the operands to each method invocations as well as the opcodes of each operation had to be stored in native memory for the native function to decode/read and execute. That storing and reading of the opcodes and operands was the major bottleneck.

Now, there is another more promising approach: Not batching the operations but simply directly calling a JNI function to do the job with optimized SSE instructions. Initial testing showed major performance increases. See the JMH results below:

joml-array (using a float[16]):

Benchmark                             Mode  Cnt          Score         Error  Units
Matrix4fBenchmarks.testInvert        thrpt    3   24760865,260 ±  910609,284  ops/s
Matrix4fBenchmarks.testMul           thrpt    3   34555251,163 ±  183270,652  ops/s
Matrix4fBenchmarks.testMulAffine     thrpt    3   52189265,415 ±  622020,725  ops/s

joml-jni (using native memory and JNI functions):

Benchmark                             Mode  Cnt          Score          Error  Units
Matrix4fBenchmarks.testInvert        thrpt    3   36367075,182 ±   283999,981  ops/s
Matrix4fBenchmarks.testMul           thrpt    3   70239126,361 ±    27891,033  ops/s
Matrix4fBenchmarks.testMulAffine     thrpt    3   76090662,949 ±   342632,179  ops/s

Work on intrinsifying all heavy/important JOML methods has started in the jni branch based off the array branch.

Issue Analytics

State:
Created 7 years ago
Reactions:2
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

httpdigestcommented, May 10, 2016

Sure thing! Btw.: these are the only methods that actually benefit from native functions. For every other method the JNI overhead of 19-22 clock cycles (on a 64-bit “server” HotSpot JVM, measured with RDTSC instruction and empty JNI function) is just too high, resulting in the pure Java version to be faster. If it wasn’t for JNI’s overhead, hand-written SIMD native code would outrun every Java method. Waiting for project Panama…

1reaction

httpdigestcommented, May 9, 2016

Here are some benchmark results on an i7 with JDK1.8.0_92:

Benchmark         Mode  Cnt   Score   Error  Units
CopyPojo          avgt    5   8,549 ± 0,525  ns/op
CopyUnsafe        avgt    5   7,222 ± 0,171  ns/op
IdentityPojo      avgt    5   6,111 ± 0,270  ns/op
IdentityUnsafe    avgt    5   5,427 ± 0,096  ns/op
InvertAVX         avgt    5  24,900 ± 5,570  ns/op
InvertAffinePojo  avgt    5  22,575 ± 0,304  ns/op
InvertPojo        avgt    5  38,973 ± 0,784  ns/op
InvertSSE         avgt    5  25,105 ± 0,645  ns/op
MulAVX            avgt    5  13,695 ± 0,727  ns/op
MulAffineAVX      avgt    5  11,378 ± 0,252  ns/op
MulAffinePojo     avgt    5  16,445 ± 0,138  ns/op
MulAffineSSE      avgt    5  13,002 ± 3,352  ns/op
MulPojo           avgt    5  25,888 ± 0,344  ns/op
MulSSE            avgt    5  15,251 ± 0,432  ns/op
ZeroPojo          avgt    5   6,186 ± 0,348  ns/op
ZeroUnsafe        avgt    5   5,060 ± 0,080  ns/op

Pojo means the normal Matrix4f Java version with the 16 primitive float fields, Unsafe means that there was sun.misc.Unsafe used for faster copying, SSE means JNI function using x86 SSE-128bit, AVX means JNI function using x86 AVX1-128bit.