question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RedisCommandTimeoutException on AWS ElastiCache Cluster

See original GitHub issue

I have a randomic timeout in java application that using Spring Redis Data 2.2.4 (Lettuce 5.2.1). I use Redis as Cache layer of an RESTFul Api Server and sometime i have a timeout. On Redis Side i have enabled Redis Slow Log but all query are under 10 milliseconds. The AWS ElastiCache Cluster is in composed by 3 shards and 2 replica with Cluster-Mode enabled (total 9 nodes m5.large). On application side, there is a Spring Task that periodically SCAN element in cache and require for some element TTL and IDLETIME because i have implemented a RefreshAhead algorithm to refresh cache value in background.

I tried to increase the number of threads ioThreadPoolSize and computationThreadPoolSize to 16 instead of 3. Timeouts have decreased but are still present.

This is the code of LettuceClientConfigurationBuilderCustomizer:


    @Value("${spring.redis.custom.cluster.enableAdaptiveRefresh:true}")
    private boolean enableAdaptiveRefresh;

    @Value("${spring.redis.custom.cluster.enableDynamicRefreshSources:true}")
    private boolean enableDynamicRefreshSources;

    @Value("${spring.redis.custom.cluster.enableSuspendReconnectOnProtocolFailure:false}")
    private boolean enableSuspendReconnectOnProtocolFailure;

    @Value("${spring.redis.custom.cluster.enableCancelCommandsOnReconnectFailure:true}")
    private boolean enableCancelCommandsOnReconnectFailure;

    @Value("${spring.redis.custom.cluster.ioThreadPoolSize:16}")
    private int ioThreadPoolSize;

    @Value("${spring.redis.custom.cluster.computationThreadPoolSize:16}")
    private int computationThreadPoolSize;

    public LettuceClientConfigurationBuilderCustomizer customizer() {

        Builder clusterTopologyRefreshOptionsBuilder = ClusterTopologyRefreshOptions.builder();

        if (enableAdaptiveRefresh) {
            clusterTopologyRefreshOptionsBuilder.enableAllAdaptiveRefreshTriggers();
        }
        clusterTopologyRefreshOptionsBuilder.dynamicRefreshSources(enableDynamicRefreshSources);
        ClusterTopologyRefreshOptions clusterTopologyRefreshOptions = clusterTopologyRefreshOptionsBuilder.build();
        ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder()
                .suspendReconnectOnProtocolFailure(enableSuspendReconnectOnProtocolFailure)
                .cancelCommandsOnReconnectFailure(enableCancelCommandsOnReconnectFailure)
                .topologyRefreshOptions(clusterTopologyRefreshOptions).build();
        DefaultClientResources.Builder defaultClientResourcesBuilder = DefaultClientResources.builder()
                .ioThreadPoolSize(ioThreadPoolSize).computationThreadPoolSize(computationThreadPoolSize)
                .dnsResolver(new DirContextDnsResolver());

        ClientResources clientResources = defaultClientResourcesBuilder.build();
        return p -> p.clientOptions(clusterClientOptions).clientResources(clientResources)
                .readFrom(ReadFrom.REPLICA_PREFERRED);
    }

With ThreadDump i see that there are 16 thread of lettuce-epollEventLoop-- in RUNNABLE status and 3 thread of lettuce-eventExecutorLoop-- in TIME_WAITING but i am not sure that i caught the right time.

Current Behavior

This is an example of stack-trace:

Stack trace
org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 second(s)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:70)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:270)
	at org.springframework.data.redis.connection.lettuce.LettuceKeyCommands.convertLettuceAccessException(LettuceKeyCommands.java:809)
	at org.springframework.data.redis.connection.lettuce.LettuceKeyCommands.ttl(LettuceKeyCommands.java:541)
	at org.springframework.data.redis.connection.DefaultedRedisConnection.ttl(DefaultedRedisConnection.java:209)
	at com.application.cache.refresh.ahead.redis.service.RedisKeyRetriever.lambda$scan$1(RedisKeyRetriever.java:68)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1870)
	at java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2467)
	at java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)
	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at com.application.cache.refresh.ahead.service.RefreshAheadService.reloadAheadCachValuesForStream(RefreshAheadService.java:66)
	at com.application.cache.refresh.ahead.service.RefreshAheadService.access$200(RefreshAheadService.java:18)
	at com.application.cache.refresh.ahead.service.RefreshAheadService$1.run(RefreshAheadService.java:55)
	at net.javacrumbs.shedlock.core.DefaultLockingTaskExecutor.executeWithLock(DefaultLockingTaskExecutor.java:64)
	at net.javacrumbs.shedlock.core.DefaultLockingTaskExecutor.executeWithLock(DefaultLockingTaskExecutor.java:43)
	at com.application.cache.refresh.ahead.service.RefreshAheadService.reloadAheadValuesOfCache(RefreshAheadService.java:51)
	at com.application.cache.refresh.ahead.task.SelectiveCacheRefreshAheadScheduler.lambda$null$0(SelectiveCacheRefreshAheadScheduler.java:40)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:67)
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 second(s)
	at io.lettuce.core.ExceptionFactory.createTimeoutException(ExceptionFactory.java:51)
	at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:114)
	at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:123)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy201.ttl(Unknown Source)
	at org.springframework.data.redis.connection.lettuce.LettuceKeyCommands.ttl(LettuceKeyCommands.java:539)
	... 33 common frames omitted

Environment

  • Lettuce version(s): [e.g. 5.2.1.RELEASE]
  • Redis version: ElastiCache Cluster Mode Enabled - Redis 5.0.5

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mp911decommented, Jun 24, 2020

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.

0reactions
mp911decommented, Jun 9, 2020

Threads look fine meaning that none of lettuce-epollEventLoop is blocked. However, the dump lists over 600 threads which might have an effect on performance.

Note that a second (taken from Command timed out after 1 second(s)) might be a simple consequence of a GC run. You might want to check also for GC pauses and align your timeouts to that.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting - Amazon ElastiCache for Redis
Go to https://console.aws.amazon.com/ec2/v2/home?#NIC: Filter the interface list by your Elasticache cluster name or the IP address got ...
Read more >
Troubleshoot connecting to an ElastiCache for Redis cluster
If you recently created the cluster, verify that the cluster creation completed and that the cluster is ready to accept connections.
Read more >
Amazon ElastiCache error messages - AWS Documentation
Error Message: Cluster node quota exceeded. Each cluster can have at most %n nodes in this region. Cause: You attempted to create or...
Read more >
Troubleshoot READONLY error after failover of ElastiCache ...
Short description. If the primary node failed over to the replica nodes in your Amazon ElastiCache cluster, then the replica takes the role...
Read more >
Restricted Redis Commands - Amazon ElastiCache for Redis
To deliver a managed service experience, restricts access to certain cache engine-specific commands that require advanced privileges. For cache clusters ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found