Cache performance
See original GitHub issueGuava’s cache performance has a lot of potential for improvement, at the cost of a higher initial memory footprint. The benchmark, with the raw numbers below, shows Guava’s performance on a 16-core Xeon E5-2698B v3 @ 2.00GHz (hyperthreading disabled) running Ubuntu 15.04 with JDK 1.8.0_45.
Of note is that Guava’s default concurrency level (4) results in similar performance to a synchronized LinkedHashMap. This is due to the cache’s overhead, in particular the GC thrashing from ConcurrentLinkedQueue and a hot read counter. It should also be noted that at the time of development synchronization was slow in JDK5, but significantly improved in JDK6_22 and beyond. This and other improvements help LinkedHashMap’s performance compared to what was observed when Guava’s cache was in development.
The easiest way to double Guava’s performance would be to replace the recencyQueue
and readCount
with a ring buffer. This would replace an unbounded linked queue with a 16-element array. When the buffer is full then a clean-up is triggered and subsequent additions are skipped (no CAS) until slots become available. This could be adapted from this implementation.
Given the increase in server memory, it may also be time to reconsider the default concurrency level. Early design concerns included the memory overhead, as most caches were not highly contended and the value could be adjusted as needed. As the number of cores and system memory has increased, it may be worth revisiting the default setting.
Read (100%)
Unbounded | ops/s (8 threads) | ops/s (16 threads) |
---|---|---|
ConcurrentHashMap (v8) | 560,367,163 | 1,171,389,095 |
ConcurrentHashMap (v7) | 301,331,240 | 542,304,172 |
Bounded | ||
Caffeine | 181,703,298 | 365,508,966 |
ConcurrentLinkedHashMap | 154,771,582 | 313,892,223 |
LinkedHashMap_Lru | 9,209,065 | 13,598,576 |
Guava (default) | 12,434,655 | 10,647,238 |
Guava (64) | 24,533,922 | 43,101,468 |
Ehcache2_Lru | 11,252,172 | 20,750,543 |
Ehcache3_Lru | 11,415,248 | 17,611,169 |
Infinispan_Old_Lru | 29,073,439 | 49,719,833 |
Infinispan_New_Lru | 4,888,027 | 4,749,506 |
Read (75%) / Write (25%)
Unbounded | ops/s (8 threads) | ops/s (16 threads) |
---|---|---|
ConcurrentHashMap (v8) | 441,965,711 | 790,602,730 |
ConcurrentHashMap (v7) | 196,215,481 | 346,479,582 |
Bounded | ||
Caffeine | 112,622,075 | 235,178,775 |
ConcurrentLinkedHashMap | 63,968,369 | 122,342,605 |
LinkedHashMap_Lru | 8,668,785 | 12,779,625 |
Guava (default) | 11,782,063 | 11,886,673 |
Guava (64) | 22,782,431 | 37,332,090 |
Ehcache2_Lru | 9,472,810 | 8,471,016 |
Ehcache3_Lru | 10,958,697 | 17,302,523 |
Infinispan_Old_Lru | 22,663,359 | 37,270,102 |
Infinispan_New_Lru | 4,753,313 | 4,885,061 |
Write (100%)
Unbounded | ops/s (8 threads) | ops/s (16 threads) |
---|---|---|
ConcurrentHashMap (v8) | 60,477,550 | 50,591,346 |
ConcurrentHashMap (v7) | 46,204,091 | 36,659,485 |
Bounded | ||
Caffeine | 55,281,751 | 47,482,019 |
ConcurrentLinkedHashMap | 23,819,597 | 39,797,969 |
LinkedHashMap_Lru | 10,179,891 | 10,859,549 |
Guava (default) | 4,764,056 | 5,446,282 |
Guava (64) | 8,128,024 | 7,483,986 |
Ehcache2_Lru | 4,205,936 | 4,697,745 |
Ehcache3_Lru | 10,051,020 | 13,939,317 |
Infinispan_Old_Lru | 7,538,859 | 7,332,973 |
Infinispan_New_Lru | 4,797,502 | 5,086,305 |
Issue Analytics
- State:
- Created 8 years ago
- Reactions:3
- Comments:8 (6 by maintainers)
Top GitHub Comments
A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).
We can increase the concurrency level (64) for better write throughput, but unsurprisingly this decreases the read throughput. This is because the cpu caching effects work against us. ConcurrentHashMap reads are faster with fewer segments (576M ops/s) and Guava’s recencyQueue is associated to the segment instead of the processor. The queue fills up slower, making it more often contended and less often in the L1/L2 cache. We still see a substantial gain, though.
Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.
I have been advocating this improvement since we first began, originally assumed that we’d get to it in a performance iteration. After leaving I didn’t have a good threaded benchmark to demonstrate the improvements, so my hand wavy assertions were ignored. There are still significant gains when moving to the Java 8 rewrite, but these changes are platform compatible and non-invasive.
I don’t know why I even bother anymore…
Closing since it seems unlikely to be resolved. This would be simple and be a large speed up, but requires an internal champion.