Provide SSE-optimized JNI functions
See original GitHub issuePreviously, I did some testing on doing matrix calculations in native functions via SSE/AVX instructions: https://github.com/JOML-CI/JOML/issues/30
That turned out to be slower than the calculations with standard scalar arithmetic operations in Java. This was due to the approach of “batching” all operations. In order for that batching to work, the operands to each method invocations as well as the opcodes of each operation had to be stored in native memory for the native function to decode/read and execute. That storing and reading of the opcodes and operands was the major bottleneck.
Now, there is another more promising approach: Not batching the operations but simply directly calling a JNI function to do the job with optimized SSE instructions. Initial testing showed major performance increases. See the JMH results below:
joml-array (using a float[16]):
Benchmark Mode Cnt Score Error Units
Matrix4fBenchmarks.testInvert thrpt 3 24760865,260 ± 910609,284 ops/s
Matrix4fBenchmarks.testMul thrpt 3 34555251,163 ± 183270,652 ops/s
Matrix4fBenchmarks.testMulAffine thrpt 3 52189265,415 ± 622020,725 ops/s
joml-jni (using native memory and JNI functions):
Benchmark Mode Cnt Score Error Units
Matrix4fBenchmarks.testInvert thrpt 3 36367075,182 ± 283999,981 ops/s
Matrix4fBenchmarks.testMul thrpt 3 70239126,361 ± 27891,033 ops/s
Matrix4fBenchmarks.testMulAffine thrpt 3 76090662,949 ± 342632,179 ops/s
Work on intrinsifying all heavy/important JOML methods has started in the jni branch based off the array branch.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:2
- Comments:10 (5 by maintainers)
Top GitHub Comments
Sure thing! Btw.: these are the only methods that actually benefit from native functions. For every other method the JNI overhead of 19-22 clock cycles (on a 64-bit “server” HotSpot JVM, measured with RDTSC instruction and empty JNI function) is just too high, resulting in the pure Java version to be faster. If it wasn’t for JNI’s overhead, hand-written SIMD native code would outrun every Java method. Waiting for project Panama…
Here are some benchmark results on an i7 with JDK1.8.0_92:
Pojo means the normal Matrix4f Java version with the 16 primitive float fields, Unsafe means that there was
sun.misc.Unsafe
used for faster copying, SSE means JNI function using x86 SSE-128bit, AVX means JNI function using x86 AVX1-128bit.