HashSpi allocates single byte arrays on the single byte update path
See original GitHub issueWhile benchmarking ACCP on code that is more or less:
MessageDigest digest = MessageDigest.getInstance("MD5");
for (<many iterations>)
{
...
digest.update(<single byte>);
}
I noticed that ACCP was generating significantly more garbage and was a lot slower than expected. I believe this is because TemplateHashSpi
allocates a single byte array on the hot path of single byte updates:
https://github.com/corretto/amazon-corretto-crypto-provider/blob/b204b018f6aa5d42b4fee0d0a94a93994bede081/template-src/com/amazon/corretto/crypto/provider/TemplateHashSpi.java#L119-L121
Since the Spi contract is inherently not threadsafe, performance on use cases such as above could be improved signficantly by caching a single byte buffer like the TemplateHmacSpi does: https://github.com/corretto/amazon-corretto-crypto-provider/blob/b204b018f6aa5d42b4fee0d0a94a93994bede081/template-src/com/amazon/corretto/crypto/provider/TemplateHmacSpi.java#L299-L304
An easy reproduction is to run the following with and without ACCP:
import java.security.MessageDigest;
public final class Test
{
public static void updateWithInt(MessageDigest digest, int val)
{
digest.update((byte) ((val >>> 24) & 0xFF));
digest.update((byte) ((val >>> 16) & 0xFF));
digest.update((byte) ((val >>> 8) & 0xFF));
digest.update((byte) ((val >>> 0) & 0xFF));
}
public static void main(String[] args) throws Exception
{
int numRounds = 100000000;
if (args.length > 0) {
numRounds = Integer.parseInt(args[0]);
}
System.out.println("Burn test of MD5");
MessageDigest digest = MessageDigest.getInstance("MD5");
System.out.println("Using Digest: " + digest.toString());
long start = System.currentTimeMillis();
for (int i = 0; i < numRounds ; i ++) {
updateWithInt(digest, i);
}
long end = System.currentTimeMillis();
System.out.println("Result: " + digest.digest());
System.out.println("Time(ms): " + (end - start));
}
}
time java -Djava.security.properties=/path/to/amazon-corretto-crypto-provider.security -cp AmazonCorrettoCryptoProvider-1.1.0-linux-x86_64.jar:. Test 100000000
Burn test of MD5
Using Digest: MD5 Message Digest from AmazonCorrettoCryptoProvider, <initialized>
Result: [B@6166e06f
Time(ms): 10358
java -cp AmazonCorrettoCryptoProvider-1.1.0-linux-x86_64.jar:. Test 10000000 12.91s user 0.25s system 117% cpu 11.227 total
vs
time java -cp . Test 100000000
Burn test of MD5
Using Digest: MD5 Message Digest from SUN, <initialized>
Result: [B@2a139a55
Time(ms): 3878
java -cp . Test 100000000 3.99s user 0.02s system 101% cpu 3.945 total
Also using sjk we can see that the Corretto version is allocating close to 900 MiBps:
$ sjk ttop -o ALLOC -p $(pgrep -f Test)
2019-08-03T23:38:31.019-0700 Process summary
process cpu=107.39%
application cpu=101.04% (user=99.38% sys=1.65%)
other: cpu=6.35%
thread count: 12
GC time=0.23% (young=0.23%, old=0.00%)
heap allocation rate 842mb/s
safe point rate: 1.6 (events/s) avg. safe point pause: 1.65ms
safe point sync time: 0.01% processing time: 0.25% (wallclock time)
[000001] user=98.21% sys= 1.50% alloc= 842mb/s - main
[000016] user= 1.17% sys= 0.01% alloc= 324kb/s - RMI TCP Connection(1)-127.0.0.1
[000018] user= 0.00% sys= 0.13% alloc= 4461b/s - JMX server connection timeout 18
[000002] user= 0.00% sys= 0.00% alloc= 0b/s - Reference Handler
[000003] user= 0.00% sys= 0.00% alloc= 0b/s - Finalizer
[000004] user= 0.00% sys= 0.00% alloc= 0b/s - Signal Dispatcher
[000011] user= 0.00% sys= 0.00% alloc= 0b/s - ForkJoinPool.commonPool-worker-1
[000012] user= 0.00% sys= 0.00% alloc= 0b/s - ForkJoinPool.commonPool-worker-2
[000013] user= 0.00% sys= 0.01% alloc= 0b/s - Native reference cleanup thread
[000014] user= 0.00% sys= 0.00% alloc= 0b/s - Attach Listener
[000015] user= 0.00% sys= 0.00% alloc= 0b/s - RMI TCP Accept-0
[000017] user= 0.00% sys= 0.00% alloc= 0b/s - RMI Scheduler(0)
vs the JDK version that allocates basically nothing:
2019-08-03T23:39:41.936-0700 Process summary
process cpu=104.07%
application cpu=100.77% (user=100.42% sys=0.35%)
other: cpu=3.30%
thread count: 9
heap allocation rate 252kb/s
safe point rate: 0.8 (events/s) avg. safe point pause: 0.12ms
safe point sync time: 0.00% processing time: 0.01% (wallclock time)
[000013] user= 0.59% sys= 0.23% alloc= 248kb/s - RMI TCP Connection(1)-127.0.0.1
[000015] user= 0.00% sys= 0.04% alloc= 4257b/s - JMX server connection timeout 15
[000001] user=99.83% sys= 0.08% alloc= 0b/s - main
[000002] user= 0.00% sys= 0.00% alloc= 0b/s - Reference Handler
[000003] user= 0.00% sys= 0.00% alloc= 0b/s - Finalizer
[000004] user= 0.00% sys= 0.00% alloc= 0b/s - Signal Dispatcher
[000010] user= 0.00% sys= 0.00% alloc= 0b/s - Attach Listener
[000012] user= 0.00% sys= 0.00% alloc= 0b/s - RMI TCP Accept-0
[000014] user= 0.00% sys= 0.00% alloc= 0b/s - RMI Scheduler(0)
JVM version information:
java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (7 by maintainers)
Top GitHub Comments
@SalusaSecondus thank you very much for the quick patch! I just rolled
1.1.1
out in our load testing Cassandra clusters and am already seeing significant improvements. It appears we’re reducing on CPU time of our digesting functions during quorum reads by up to 50% (so we’re going from 20% on CPU time to 10% on CPU time according to flamegraphs). I’ve also been able to enable AES-GCM without any noticeable increase in CPU load, which is an achievement all in itself.Version 1.1.1 released with this fix.