DNS resolving issue for AWS Elsticache nodes
See original GitHub issueActual behavior
We use Redisson client to work with to several big Elasticache clusters(up to 150 nodes). Under the hood, Redisson uses netty for work with DNS resolving. We get a lot of errors like this:
io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'test-elasticache-cluster-0003-002.test-eslaticache-cluster.nfbjaw.euw1.cache.amazonaws.com' and search domain query for configured domains failed as well: [eu-west-1.compute.internal]
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1047)
at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1000)
at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:418)
at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:66)
at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:467)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:240)
at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:192)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/*.*.*.*:53] query via UDP timed out after 5000 milliseconds (no stack trace available)
The more nodes in a cluster - the more errors we get. Firstly we think that we faced some internal AWS limits, but AWS support checked it and didn’t confirm reaching any limits on their side. Potentially the problem can be in netty DNS resolving mechanism. I searched open issues in this repository with similar problems and this one looks pretty similar https://github.com/netty/netty/issues/11993
Expected behavior
DNS names of elasticache nodes should be resolved without errors or pls suggest workarounds or best practices how to avoid such erros
Steps to reproduce
Collegue of mine investigated the hypothesis with the problem in netty DNS resolving process and created a simple test app based on Redisson DNSMonitor to reproduce the problem. This class uses netty for DNS resolving and it allowed us to reproduce the problem.
package com.test;
import io.netty.channel.EventLoop;
import io.netty.channel.nio.NioEventLoopGroup;
import io.netty.channel.socket.nio.NioDatagramChannel;
import io.netty.handler.logging.LogLevel;
import io.netty.resolver.AddressResolver;
import io.netty.resolver.dns.DefaultDnsCache;
import io.netty.resolver.dns.DefaultDnsCnameCache;
import io.netty.resolver.dns.DnsAddressResolverGroup;
import io.netty.resolver.dns.DnsNameResolverBuilder;
import io.netty.resolver.dns.DnsServerAddressStreamProviders;
import io.netty.resolver.dns.LoggingDnsQueryLifeCycleObserverFactory;
import io.netty.util.concurrent.Future;
import io.netty.util.concurrent.FutureListener;
import java.net.InetSocketAddress;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import picocli.CommandLine;
public class TestDNS {
private static final Logger log = LoggerFactory.getLogger(TestDNS.class);
// possibility of timeout increases with high resolver count
private static final int MAX_RESOLVER_COUNT = Runtime.getRuntime().availableProcessors() * 4;
private static final int MAX_THREAD_COUNT = Runtime.getRuntime().availableProcessors() * 4;
private static final int MAX_MONITOR_COUNT = Runtime.getRuntime().availableProcessors() * 4;
private static final int MONITOR_INTERVAL_SECOND = 3;
private static final int MIN_TTL = 15;
private static final int MAX_TTL = 300;
private static final int NEGATIVE_TTL = 15;
public static void main(String[] args) {
Parameters parameters = new Parameters();
new CommandLine(parameters).parseArgs(args);
NioEventLoopGroup defaultEventLoopGroup = new NioEventLoopGroup(MAX_THREAD_COUNT);
NioEventLoopGroup nioGroup = new NioEventLoopGroup(MAX_THREAD_COUNT);
AddressResolver<InetSocketAddress>[] arr = new AddressResolver[MAX_RESOLVER_COUNT];
DefaultDnsCnameCache defaultDnsCnameCache = new DefaultDnsCnameCache(MIN_TTL, MAX_TTL);
DefaultDnsCache defaultDnsCache = new DefaultDnsCache(MIN_TTL, MAX_TTL, NEGATIVE_TTL);
for (int i = 0; i < MAX_RESOLVER_COUNT; i++) {
arr[i] = createResolver(defaultEventLoopGroup, nioGroup, defaultDnsCache, defaultDnsCnameCache);
}
if (parameters.slave) {
log.info("Slave dns check starting");
Map<Host, InetSocketAddress> slaves = getSlaves();
for (Map.Entry<Host, InetSocketAddress> entry : slaves.entrySet()) {
Future<InetSocketAddress> resolveFuture = arr[0].resolve(InetSocketAddress.createUnresolved(entry.getKey().host(), entry.getKey().port()));
resolveFuture.syncUninterruptibly();
slaves.put(entry.getKey(), resolveFuture.getNow());
}
for (int i = 0; i < MAX_MONITOR_COUNT; i++) {
// Pick different resolver
scheduleMonitorSlaves(defaultEventLoopGroup, arr[(i+MAX_MONITOR_COUNT) % arr.length], slaves);
}
}
if (parameters.master) {
log.info("Master dns check starting");
scheduleMonitorMaster(defaultEventLoopGroup, arr[0], new Host("clustercfg.production-game-objects.nfbjaw.euw1.cache.amazonaws.com", 6379));
}
System.out.println("Press any key for exit");
Scanner userInput = new Scanner(System.in);
if (!userInput.hasNext()) ;
System.out.println("Application will stop soon");
}
private static AddressResolver<InetSocketAddress> createResolver(NioEventLoopGroup defaultEventLoopGroup, NioEventLoopGroup group, DefaultDnsCache defaultDnsCache, DefaultDnsCnameCache defaultDnsCnameCache) {
final EventLoop loop = group.next();
DnsNameResolverBuilder builder = new DnsNameResolverBuilder()
.eventLoop(loop).channelType(NioDatagramChannel.class).queryTimeoutMillis(5000)
.dnsQueryLifecycleObserverFactory(new LoggingDnsQueryLifeCycleObserverFactory(LogLevel.INFO))
.nameServerProvider(DnsServerAddressStreamProviders.platformDefault()).resolveCache(defaultDnsCache).cnameCache(defaultDnsCnameCache);
DnsAddressResolverGroup resolverGroup = new DnsAddressResolverGroup(builder);
return resolverGroup.getResolver(defaultEventLoopGroup.next());
}
private static void scheduleMonitorSlaves(NioEventLoopGroup defaultEventLoopGroup, AddressResolver<InetSocketAddress> resolver, Map<Host, InetSocketAddress> slaves) {
defaultEventLoopGroup.schedule(() -> {
CompletableFuture<Void> future = monitorSlaves(resolver, slaves);
future.whenComplete((r,e) -> scheduleMonitorSlaves(defaultEventLoopGroup, resolver, slaves));
}, MONITOR_INTERVAL_SECOND, TimeUnit.SECONDS);
}
private static void scheduleMonitorMaster(NioEventLoopGroup defaultEventLoopGroup, AddressResolver<InetSocketAddress> resolver, Host master) {
defaultEventLoopGroup.schedule(() -> {
CompletableFuture<Void> future = monitorMaster(resolver, master);
future.whenComplete((r,e) -> scheduleMonitorMaster(defaultEventLoopGroup, resolver, master));
}, MONITOR_INTERVAL_SECOND, TimeUnit.SECONDS);
}
private static CompletableFuture<Void> monitorMaster(AddressResolver<InetSocketAddress> resolver, Host master) {
log.info("Monitor master is working");
CompletableFuture<Void> promise = new CompletableFuture<>();
Future<List<InetSocketAddress>> allNodes = resolver.resolveAll(InetSocketAddress.createUnresolved(master.host(), master.port()));
allNodes.addListener(new FutureListener<List<InetSocketAddress>>() {
@Override
public void operationComplete(Future<List<InetSocketAddress>> future) throws Exception {
if (!future.isSuccess()) {
promise.complete(null);
log.error("Master monitor err=", future.cause());
return;
}
List<InetSocketAddress> nodes = new ArrayList<>();
for (InetSocketAddress address : future.getNow()) {
nodes.add(address);
}
log.info("Resolve ALL node size is {}", nodes.size());
promise.complete(null);
}
});
return CompletableFuture.allOf(promise);
}
private static CompletableFuture<Void> monitorSlaves(AddressResolver<InetSocketAddress> resolver, Map<Host, InetSocketAddress> slaves) {
log.info("Monitor slaves is working");
List<CompletableFuture<Void>> futures = new ArrayList<>();
for (Map.Entry<Host, InetSocketAddress> entry : slaves.entrySet()) {
CompletableFuture<Void> promise = new CompletableFuture<>();
futures.add(promise);
log.debug("Request sent to resolve ip address for slave host: {}", entry.getKey().host());
Future<InetSocketAddress> resolveFuture = resolver.resolve(InetSocketAddress.createUnresolved(entry.getKey().host(), entry.getKey().port()));
resolveFuture.addListener((FutureListener<InetSocketAddress>) future -> {
if (!future.isSuccess()) {
log.error("Unable to resolve " + entry.getKey().host(), future.cause());
promise.complete(null);
return;
}
log.debug("Resolved ip: {} for slave host: {}", future.getNow().getAddress(), entry.getKey().host());
InetSocketAddress currentSlaveAddr = entry.getValue();
InetSocketAddress newSlaveAddr = future.getNow();
if (!newSlaveAddr.getAddress().equals(currentSlaveAddr.getAddress())) {
log.info("Detected DNS change. Slave {} has changed ip from {} to {}", entry.getKey().host(), currentSlaveAddr.getAddress().getHostAddress(), newSlaveAddr.getAddress().getHostAddress());
}
promise.complete(null);
});
}
return CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]));
}
private static Map<Host, InetSocketAddress> getSlaves() {
Map<Host, InetSocketAddress> slaves = new HashMap<>();
// add 44 nodes of real elsaticache cluster
return slaves;
}
}
Netty version
4.1.75
JVM version (e.g. java -version)
java-17-amazon-corretto-jdk_17.0.3.6-1_amd64
OS version (e.g. uname -a)
ubuntu-bionic-18.04-amd64
Issue Analytics
- State:
- Created a year ago
- Comments:12 (5 by maintainers)

Top Related StackOverflow Question
I will have a look
I sent out #13014 that adds a new construct called
ConcurrencyLimitwhich abstracts how we should limit concurrent actions, with one simple implementation that’s similar to @mrniko’sAsyncSemaphore. Please let me know what you think. Once approved and merged, I’ll send out a follow-up PR that usesConcurrencyLimitfor limiting DNS queries.