RedisCommandTimeoutException on AWS ElastiCache Cluster
See original GitHub issueI have a randomic timeout in java application that using Spring Redis Data 2.2.4 (Lettuce 5.2.1). I use Redis as Cache layer of an RESTFul Api Server and sometime i have a timeout. On Redis Side i have enabled Redis Slow Log but all query are under 10 milliseconds. The AWS ElastiCache Cluster is in composed by 3 shards and 2 replica with Cluster-Mode enabled (total 9 nodes m5.large). On application side, there is a Spring Task that periodically SCAN element in cache and require for some element TTL and IDLETIME because i have implemented a RefreshAhead algorithm to refresh cache value in background.
I tried to increase the number of threads ioThreadPoolSize and computationThreadPoolSize to 16 instead of 3. Timeouts have decreased but are still present.
This is the code of LettuceClientConfigurationBuilderCustomizer:
@Value("${spring.redis.custom.cluster.enableAdaptiveRefresh:true}")
private boolean enableAdaptiveRefresh;
@Value("${spring.redis.custom.cluster.enableDynamicRefreshSources:true}")
private boolean enableDynamicRefreshSources;
@Value("${spring.redis.custom.cluster.enableSuspendReconnectOnProtocolFailure:false}")
private boolean enableSuspendReconnectOnProtocolFailure;
@Value("${spring.redis.custom.cluster.enableCancelCommandsOnReconnectFailure:true}")
private boolean enableCancelCommandsOnReconnectFailure;
@Value("${spring.redis.custom.cluster.ioThreadPoolSize:16}")
private int ioThreadPoolSize;
@Value("${spring.redis.custom.cluster.computationThreadPoolSize:16}")
private int computationThreadPoolSize;
public LettuceClientConfigurationBuilderCustomizer customizer() {
Builder clusterTopologyRefreshOptionsBuilder = ClusterTopologyRefreshOptions.builder();
if (enableAdaptiveRefresh) {
clusterTopologyRefreshOptionsBuilder.enableAllAdaptiveRefreshTriggers();
}
clusterTopologyRefreshOptionsBuilder.dynamicRefreshSources(enableDynamicRefreshSources);
ClusterTopologyRefreshOptions clusterTopologyRefreshOptions = clusterTopologyRefreshOptionsBuilder.build();
ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder()
.suspendReconnectOnProtocolFailure(enableSuspendReconnectOnProtocolFailure)
.cancelCommandsOnReconnectFailure(enableCancelCommandsOnReconnectFailure)
.topologyRefreshOptions(clusterTopologyRefreshOptions).build();
DefaultClientResources.Builder defaultClientResourcesBuilder = DefaultClientResources.builder()
.ioThreadPoolSize(ioThreadPoolSize).computationThreadPoolSize(computationThreadPoolSize)
.dnsResolver(new DirContextDnsResolver());
ClientResources clientResources = defaultClientResourcesBuilder.build();
return p -> p.clientOptions(clusterClientOptions).clientResources(clientResources)
.readFrom(ReadFrom.REPLICA_PREFERRED);
}
With ThreadDump i see that there are 16 thread of lettuce-epollEventLoop-- in RUNNABLE status and 3 thread of lettuce-eventExecutorLoop-- in TIME_WAITING but i am not sure that i caught the right time.
Current Behavior
This is an example of stack-trace:
Stack trace
org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 second(s)
at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:70)
at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:270)
at org.springframework.data.redis.connection.lettuce.LettuceKeyCommands.convertLettuceAccessException(LettuceKeyCommands.java:809)
at org.springframework.data.redis.connection.lettuce.LettuceKeyCommands.ttl(LettuceKeyCommands.java:541)
at org.springframework.data.redis.connection.DefaultedRedisConnection.ttl(DefaultedRedisConnection.java:209)
at com.application.cache.refresh.ahead.redis.service.RedisKeyRetriever.lambda$scan$1(RedisKeyRetriever.java:68)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1870)
at java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2467)
at java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at com.application.cache.refresh.ahead.service.RefreshAheadService.reloadAheadCachValuesForStream(RefreshAheadService.java:66)
at com.application.cache.refresh.ahead.service.RefreshAheadService.access$200(RefreshAheadService.java:18)
at com.application.cache.refresh.ahead.service.RefreshAheadService$1.run(RefreshAheadService.java:55)
at net.javacrumbs.shedlock.core.DefaultLockingTaskExecutor.executeWithLock(DefaultLockingTaskExecutor.java:64)
at net.javacrumbs.shedlock.core.DefaultLockingTaskExecutor.executeWithLock(DefaultLockingTaskExecutor.java:43)
at com.application.cache.refresh.ahead.service.RefreshAheadService.reloadAheadValuesOfCache(RefreshAheadService.java:51)
at com.application.cache.refresh.ahead.task.SelectiveCacheRefreshAheadScheduler.lambda$null$0(SelectiveCacheRefreshAheadScheduler.java:40)
at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:67)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 second(s)
at io.lettuce.core.ExceptionFactory.createTimeoutException(ExceptionFactory.java:51)
at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:114)
at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:123)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy201.ttl(Unknown Source)
at org.springframework.data.redis.connection.lettuce.LettuceKeyCommands.ttl(LettuceKeyCommands.java:539)
... 33 common frames omitted
Environment
- Lettuce version(s): [e.g. 5.2.1.RELEASE]
- Redis version: ElastiCache Cluster Mode Enabled - Redis 5.0.5
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.
Threads look fine meaning that none of
lettuce-epollEventLoop
is blocked. However, the dump lists over 600 threads which might have an effect on performance.Note that a second (taken from
Command timed out after 1 second(s)
) might be a simple consequence of a GC run. You might want to check also for GC pauses and align your timeouts to that.