question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cache performance

See original GitHub issue

Guava’s cache performance has a lot of potential for improvement, at the cost of a higher initial memory footprint. The benchmark, with the raw numbers below, shows Guava’s performance on a 16-core Xeon E5-2698B v3 @ 2.00GHz (hyperthreading disabled) running Ubuntu 15.04 with JDK 1.8.0_45.

Of note is that Guava’s default concurrency level (4) results in similar performance to a synchronized LinkedHashMap. This is due to the cache’s overhead, in particular the GC thrashing from ConcurrentLinkedQueue and a hot read counter. It should also be noted that at the time of development synchronization was slow in JDK5, but significantly improved in JDK6_22 and beyond. This and other improvements help LinkedHashMap’s performance compared to what was observed when Guava’s cache was in development.

The easiest way to double Guava’s performance would be to replace the recencyQueue and readCount with a ring buffer. This would replace an unbounded linked queue with a 16-element array. When the buffer is full then a clean-up is triggered and subsequent additions are skipped (no CAS) until slots become available. This could be adapted from this implementation.

Given the increase in server memory, it may also be time to reconsider the default concurrency level. Early design concerns included the memory overhead, as most caches were not highly contended and the value could be adjusted as needed. As the number of cores and system memory has increased, it may be worth revisiting the default setting.

Read (100%)

Unbounded ops/s (8 threads) ops/s (16 threads)
ConcurrentHashMap (v8) 560,367,163 1,171,389,095
ConcurrentHashMap (v7) 301,331,240 542,304,172
Bounded
Caffeine 181,703,298 365,508,966
ConcurrentLinkedHashMap 154,771,582 313,892,223
LinkedHashMap_Lru 9,209,065 13,598,576
Guava (default) 12,434,655 10,647,238
Guava (64) 24,533,922 43,101,468
Ehcache2_Lru 11,252,172 20,750,543
Ehcache3_Lru 11,415,248 17,611,169
Infinispan_Old_Lru 29,073,439 49,719,833
Infinispan_New_Lru 4,888,027 4,749,506

Read (75%) / Write (25%)

Unbounded ops/s (8 threads) ops/s (16 threads)
ConcurrentHashMap (v8) 441,965,711 790,602,730
ConcurrentHashMap (v7) 196,215,481 346,479,582
Bounded
Caffeine 112,622,075 235,178,775
ConcurrentLinkedHashMap 63,968,369 122,342,605
LinkedHashMap_Lru 8,668,785 12,779,625
Guava (default) 11,782,063 11,886,673
Guava (64) 22,782,431 37,332,090
Ehcache2_Lru 9,472,810 8,471,016
Ehcache3_Lru 10,958,697 17,302,523
Infinispan_Old_Lru 22,663,359 37,270,102
Infinispan_New_Lru 4,753,313 4,885,061

Write (100%)

Unbounded ops/s (8 threads) ops/s (16 threads)
ConcurrentHashMap (v8) 60,477,550 50,591,346
ConcurrentHashMap (v7) 46,204,091 36,659,485
Bounded
Caffeine 55,281,751 47,482,019
ConcurrentLinkedHashMap 23,819,597 39,797,969
LinkedHashMap_Lru 10,179,891 10,859,549
Guava (default) 4,764,056 5,446,282
Guava (64) 8,128,024 7,483,986
Ehcache2_Lru 4,205,936 4,697,745
Ehcache3_Lru 10,051,020 13,939,317
Infinispan_Old_Lru 7,538,859 7,332,973
Infinispan_New_Lru 4,797,502 5,086,305

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:3
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

8reactions
ben-manescommented, Jun 5, 2015

A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).

Benchmark ops/s (8 threads) ops/s (16 threads)
Read (100%) 121,808,437 260,593,210
Read (75%) / Write (25%) 34,770,671 48,538,555
Write (100%) 5,064,388 6,045,980

We can increase the concurrency level (64) for better write throughput, but unsurprisingly this decreases the read throughput. This is because the cpu caching effects work against us. ConcurrentHashMap reads are faster with fewer segments (576M ops/s) and Guava’s recencyQueue is associated to the segment instead of the processor. The queue fills up slower, making it more often contended and less often in the L1/L2 cache. We still see a substantial gain, though.

Benchmark ops/s (8 threads) ops/s (16 threads)
Read (100%) 40,935,985 84,109,039
Read (75%) / Write (25%) 40,854,227 64,462,500
Write (100%) 6,757,178 6,570,233

Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.

Computer sameKey ops/s spread ops/s
ConcurrentHashMap (v8) 30,321,032 64,427,143
Guava (current) 23,149,556 71,423,966
Guava (patched) 329,220,794 166,448,754
Caffeine 1,558,302,420 533,181,707

I have been advocating this improvement since we first began, originally assumed that we’d get to it in a performance iteration. After leaving I didn’t have a good threaded benchmark to demonstrate the improvements, so my hand wavy assertions were ignored. There are still significant gains when moving to the Java 8 rewrite, but these changes are platform compatible and non-invasive.

I don’t know why I even bother anymore…

7reactions
ben-manescommented, Feb 10, 2021

Closing since it seems unlikely to be resolved. This would be simple and be a large speed up, but requires an internal champion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cache Memory Performance - GeeksforGeeks
The performance of the cache memory is measured in terms of a quantity called Hit Ratio. When the CPU refers to the memory...
Read more >
Cache Performance - ECE UNM
Cache Performance · Average memory access time is a useful measure to evaluate the performance of a memory-hierarchy configuration. · It tells us...
Read more >
Cache performance - Washington
Memory system performance depends upon the cache hit time, miss rate and miss penalty, as well as the actual program being executed. —...
Read more >
Measuring Cache Performance
Measuring Cache Performance. ▫ Components of CPU time. ▫ Program execution cycles. ▫ Includes cache hit time. ▫ Memory stall cycles.
Read more >
Explain the performance of cache in computer architecture?
The main reason for containing cache memory in a computer is to increase system performance by decreasing the time required to access memory ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found