Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cache performance

See original GitHub issue

Guava’s cache performance has a lot of potential for improvement, at the cost of a higher initial memory footprint. The benchmark, with the raw numbers below, shows Guava’s performance on a 16-core Xeon E5-2698B v3 @ 2.00GHz (hyperthreading disabled) running Ubuntu 15.04 with JDK 1.8.0_45.

Of note is that Guava’s default concurrency level (4) results in similar performance to a synchronized LinkedHashMap. This is due to the cache’s overhead, in particular the GC thrashing from ConcurrentLinkedQueue and a hot read counter. It should also be noted that at the time of development synchronization was slow in JDK5, but significantly improved in JDK6_22 and beyond. This and other improvements help LinkedHashMap’s performance compared to what was observed when Guava’s cache was in development.

The easiest way to double Guava’s performance would be to replace the recencyQueue and readCount with a ring buffer. This would replace an unbounded linked queue with a 16-element array. When the buffer is full then a clean-up is triggered and subsequent additions are skipped (no CAS) until slots become available. This could be adapted from this implementation.

Given the increase in server memory, it may also be time to reconsider the default concurrency level. Early design concerns included the memory overhead, as most caches were not highly contended and the value could be adjusted as needed. As the number of cores and system memory has increased, it may be worth revisiting the default setting.

Read (100%)

Unbounded	ops/s (8 threads)	ops/s (16 threads)
ConcurrentHashMap (v8)	560,367,163	1,171,389,095
ConcurrentHashMap (v7)	301,331,240	542,304,172

Bounded
Caffeine	181,703,298	365,508,966
ConcurrentLinkedHashMap	154,771,582	313,892,223
LinkedHashMap_Lru	9,209,065	13,598,576
Guava (default)	12,434,655	10,647,238
Guava (64)	24,533,922	43,101,468
Ehcache2_Lru	11,252,172	20,750,543
Ehcache3_Lru	11,415,248	17,611,169
Infinispan_Old_Lru	29,073,439	49,719,833
Infinispan_New_Lru	4,888,027	4,749,506

Read (75%) / Write (25%)

Unbounded	ops/s (8 threads)	ops/s (16 threads)
ConcurrentHashMap (v8)	441,965,711	790,602,730
ConcurrentHashMap (v7)	196,215,481	346,479,582

Bounded
Caffeine	112,622,075	235,178,775
ConcurrentLinkedHashMap	63,968,369	122,342,605
LinkedHashMap_Lru	8,668,785	12,779,625
Guava (default)	11,782,063	11,886,673
Guava (64)	22,782,431	37,332,090
Ehcache2_Lru	9,472,810	8,471,016
Ehcache3_Lru	10,958,697	17,302,523
Infinispan_Old_Lru	22,663,359	37,270,102
Infinispan_New_Lru	4,753,313	4,885,061

Write (100%)

Unbounded	ops/s (8 threads)	ops/s (16 threads)
ConcurrentHashMap (v8)	60,477,550	50,591,346
ConcurrentHashMap (v7)	46,204,091	36,659,485

Bounded
Caffeine	55,281,751	47,482,019
ConcurrentLinkedHashMap	23,819,597	39,797,969
LinkedHashMap_Lru	10,179,891	10,859,549
Guava (default)	4,764,056	5,446,282
Guava (64)	8,128,024	7,483,986
Ehcache2_Lru	4,205,936	4,697,745
Ehcache3_Lru	10,051,020	13,939,317
Infinispan_Old_Lru	7,538,859	7,332,973
Infinispan_New_Lru	4,797,502	5,086,305

Issue Analytics

State:
Created 8 years ago
Reactions:3
Comments:8 (6 by maintainers)

Top GitHub Comments

8reactions

ben-manescommented, Jun 5, 2015

A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).

Benchmark	ops/s (8 threads)	ops/s (16 threads)
Read (100%)	121,808,437	260,593,210
Read (75%) / Write (25%)	34,770,671	48,538,555
Write (100%)	5,064,388	6,045,980

We can increase the concurrency level (64) for better write throughput, but unsurprisingly this decreases the read throughput. This is because the cpu caching effects work against us. ConcurrentHashMap reads are faster with fewer segments (576M ops/s) and Guava’s recencyQueue is associated to the segment instead of the processor. The queue fills up slower, making it more often contended and less often in the L1/L2 cache. We still see a substantial gain, though.

Benchmark	ops/s (8 threads)	ops/s (16 threads)
Read (100%)	40,935,985	84,109,039
Read (75%) / Write (25%)	40,854,227	64,462,500
Write (100%)	6,757,178	6,570,233

Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.

Computer	sameKey ops/s	spread ops/s
ConcurrentHashMap (v8)	30,321,032	64,427,143
Guava (current)	23,149,556	71,423,966
Guava (patched)	329,220,794	166,448,754
Caffeine	1,558,302,420	533,181,707

I have been advocating this improvement since we first began, originally assumed that we’d get to it in a performance iteration. After leaving I didn’t have a good threaded benchmark to demonstrate the improvements, so my hand wavy assertions were ignored. There are still significant gains when moving to the Java 8 rewrite, but these changes are platform compatible and non-invasive.

I don’t know why I even bother anymore…

7reactions

ben-manescommented, Feb 10, 2021

Closing since it seems unlikely to be resolved. This would be simple and be a large speed up, but requires an internal champion.