question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LoadingCache stops loading forever after race condition

See original GitHub issue

The following code demonstrates how a LoadingCache stops loading forever if invalidateAll() is called during a load operation.

import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import com.google.common.base.Preconditions;
import com.google.common.cache.CacheBuilder;
import com.google.common.cache.CacheLoader;
import com.google.common.cache.LoadingCache;
import com.google.common.util.concurrent.ThreadFactoryBuilder;

public class CacheBug {
    private static final ExecutorService EXECUTOR = Executors.newSingleThreadExecutor(new ThreadFactoryBuilder().setDaemon(true).build());

    static final LoadingCache<String, Long> THE_CACHE = CacheBuilder.newBuilder()
            .refreshAfterWrite(1, TimeUnit.MILLISECONDS)
            .build(new CacheLoader<String, Long>() {
                @Override
                public Long load(String s) throws InterruptedException {
                    Thread.sleep(200);
                    return System.currentTimeMillis();
                }
            });

    public static void main(String[] args) throws ExecutionException, InterruptedException {
        THE_CACHE.getUnchecked("");
        Thread.sleep(100);
        EXECUTOR.submit(() -> THE_CACHE.get(""));
        for (int i = 0; i < 10; i++) {
            Thread.sleep(100);
            THE_CACHE.invalidateAll();
            System.err.println("Current time is: " + THE_CACHE.get(""));
        }
        Preconditions.checkArgument(System.currentTimeMillis() - THE_CACHE.get("") < 500);
    }
}

Output:

Current time is: 1508359149314
Current time is: 1508359149217
Current time is: 1508359149217
Current time is: 1508359149217
Current time is: 1508359149217
Current time is: 1508359149217
Current time is: 1508359149217
Current time is: 1508359149217
Current time is: 1508359149217
Current time is: 1508359149217
Exception in thread "main" java.lang.IllegalArgumentException
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:120)
	at CacheBug.main(CacheBug.java:34)

The expected behaviour is that the time should increase (by about 300ms) in every iteration.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
fdesucommented, Nov 25, 2017

@lowasser So roughly speaking, that’s what happens:

  Main            + Pool-1-Thread-1
+-----------------------------------+
                  |
  count == 0      |
                  |
  get new value   |
                  |
  load new value..|
                  |
  store new value |
                  |
  count <= 1      |
+-----------------------------------+
                  |
                  |  check for value
                  |
                  |  value isn't null
                  |
                  |  refresh value
                  |
                  |  load new value..
+-----------------------------------+
                  |
  invalidate all  |
                  |
  count <= 0      |
                  |
+-----------------------------------+
                  |
  set up new head |
  in the bucket   |
                  |
  load new value..|
                  |
+-----------------------------------+
                  |
                  |  store ready value
                  |
                  |  value was set up
                  |  previously by main
                  |  w/ isLoading=>true
                  |
                  |
                  |  value is null =>
                  |  publish COLLECTED
                  |
                  |  oldValue.isActive
                  |  => true => this.count
                  |  remains constant
                  |
                  |  store new value
                  |
                  |  count <= 0
                  |
+------------------------------------+
                  |
  try to store    |
  ready value     |
                  |
  don't store it  |
                  |
  publish REPLACED|
                  |
  count remains   |
  constant = 0    |
                  +

summarizing that, if I didn’t make a mistake, we clearly could see, that while value is loading by pool-1-thread-1 we’re doing invalidateAll, which causes us to set up this.count = 0. Then main wakes up and tries to load a value, as underlying entry in a table is empty and shall be initialized (then sleeps for about 200 ms). poll-1-thread-1 wakes up, sets up it’s value, but, what it does here is it checks, if there’s already a bucket and entry initialized (which is generously done by main), however, a value in an entry is still loading. pool-1-thread-1 substitutes the value with it’s own loaded one and doesn’t affect this.count which is 0 (it actually does --newCount and stores it) at the moment. main apparently wakes up and tries to set up it’s value, however fails to pass the condition:

if (oldValueReference == valueReference
                || (entryValue == null && valueReference != UNSET)) {

and thus doesn’t update a value along with this.count which is still equals to zero. Thus main publishes REPLACED and exits.

All the subsequent loads fail miserably as main, and all the invalidateAll attempts fail because of this.count == 0.

In my opinion there’s two ways to fix this issue:

1st if kinda safe and quickest one, however not the best one, as I think. It is just to delete a condition:

if (count != 0) { // read-volatile

at LocalCache.Segment#clear:3357 which prevents invalidateAll from clearing kinda broken segment.

or the second one is to update LocalCache.Segment#storeLoadedValue:3284 and add here something like follows, to catch the case, when oldValueReference.isLoading == false && newValueReference.isLoading == true and we still want to update a value and increment this.count.

} else if (!oldValueReference.isActive() && valueReference.isActive()) {
              setValue(e, key, newValue, now);
              this.count = newCount;
              evictEntries(e);
              enqueueNotification(key, hash, newValue, oldValueReference.getWeight(), RemovalCause.REPLACED);
              return true;
}

I must admit, that I like 2nd option (not exactly that implementation but I like the approach) much more because in my opinion it fixes the root cause (we fail to update a value in a bucket), however it breaks the test LocalCacheTest.testSegmentStoreComputedValue for package-private storeLoadedValue for the clobbered case. However 1st option is quite safe (I don’t take in account performance drawback which caused by invalidateAll being clearing an internal table in cause when count should be 0 and table is actually empty).

So, @lowasser what do you think on that? Maybe I should do a deeper dive? Which option do you like the most?

0reactions
yingqin678commented, Dec 7, 2018

@fdesu ths for the reply very much.maybe i’m more curious in the origin reason why it happens,just let me think about it more

Read more comments on GitHub >

github_iconTop Results From Across the Web

LoadingCache stop auto-refreshing and only serves stale data
It seems that for some reason LoadingCache is not seeing that the data is much older than 5 minutes. After I restart the...
Read more >
Race condition when caching using the get-compute-put pattern
As you can see, the invalidate operation is invoked right away, despite an ongoing cache load for the k key. Luckily, not all...
Read more >
Co-op Career - Stuck on loading screen (Waiting for players or ...
Hello, I have problem with stuck in loading screen. ... After the race, we turned off the game and then we came back...
Read more >
Confluent Platform Component Changelogs
Confluent Platform Docker Images will no longer ship with Yum/Apt configuration that allowed users to update running Confluent Platform containers to the next ......
Read more >
Hadoop 2.4.1 Release Notes
Run yarn application -list -appStates FAILED, It does not print http protocol name like ... Race condition in failover can cause RetryCache fail...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found