question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DNS resolving issue for AWS Elsticache nodes

See original GitHub issue

Actual behavior

We use Redisson client to work with to several big Elasticache clusters(up to 150 nodes). Under the hood, Redisson uses netty for work with DNS resolving. We get a lot of errors like this:

io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'test-elasticache-cluster-0003-002.test-eslaticache-cluster.nfbjaw.euw1.cache.amazonaws.com' and search domain query for configured domains failed as well: [eu-west-1.compute.internal]
	at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1047)
	at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1000)
	at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:418)
	at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:66)
	at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:467)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
	at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:240)
	at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:192)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/*.*.*.*:53] query via UDP timed out after 5000 milliseconds (no stack trace available)

The more nodes in a cluster - the more errors we get. Firstly we think that we faced some internal AWS limits, but AWS support checked it and didn’t confirm reaching any limits on their side. Potentially the problem can be in netty DNS resolving mechanism. I searched open issues in this repository with similar problems and this one looks pretty similar https://github.com/netty/netty/issues/11993

Expected behavior

DNS names of elasticache nodes should be resolved without errors or pls suggest workarounds or best practices how to avoid such erros

Steps to reproduce

Collegue of mine investigated the hypothesis with the problem in netty DNS resolving process and created a simple test app based on Redisson DNSMonitor to reproduce the problem. This class uses netty for DNS resolving and it allowed us to reproduce the problem.

package com.test;

import io.netty.channel.EventLoop;
import io.netty.channel.nio.NioEventLoopGroup;
import io.netty.channel.socket.nio.NioDatagramChannel;
import io.netty.handler.logging.LogLevel;
import io.netty.resolver.AddressResolver;
import io.netty.resolver.dns.DefaultDnsCache;
import io.netty.resolver.dns.DefaultDnsCnameCache;
import io.netty.resolver.dns.DnsAddressResolverGroup;
import io.netty.resolver.dns.DnsNameResolverBuilder;
import io.netty.resolver.dns.DnsServerAddressStreamProviders;
import io.netty.resolver.dns.LoggingDnsQueryLifeCycleObserverFactory;
import io.netty.util.concurrent.Future;
import io.netty.util.concurrent.FutureListener;
import java.net.InetSocketAddress;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.concurrent.CompletableFuture;

import java.util.concurrent.TimeUnit;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import picocli.CommandLine;

public class TestDNS {
    private static final Logger log = LoggerFactory.getLogger(TestDNS.class);
    // possibility of timeout increases with high resolver count
    private static final int MAX_RESOLVER_COUNT =  Runtime.getRuntime().availableProcessors() * 4;

    private static final int MAX_THREAD_COUNT =  Runtime.getRuntime().availableProcessors() * 4;

    private static final int MAX_MONITOR_COUNT = Runtime.getRuntime().availableProcessors() * 4;
    private static final int MONITOR_INTERVAL_SECOND = 3;
    private static final int MIN_TTL = 15;
    private static final int MAX_TTL = 300;
    private static final int NEGATIVE_TTL = 15;


    public static void main(String[] args)  {
        Parameters parameters = new Parameters();
        new CommandLine(parameters).parseArgs(args);
        NioEventLoopGroup defaultEventLoopGroup = new NioEventLoopGroup(MAX_THREAD_COUNT);
        NioEventLoopGroup nioGroup = new NioEventLoopGroup(MAX_THREAD_COUNT);
        AddressResolver<InetSocketAddress>[] arr = new AddressResolver[MAX_RESOLVER_COUNT];
        DefaultDnsCnameCache defaultDnsCnameCache = new DefaultDnsCnameCache(MIN_TTL, MAX_TTL);
        DefaultDnsCache defaultDnsCache = new DefaultDnsCache(MIN_TTL, MAX_TTL, NEGATIVE_TTL);
        for (int i = 0; i < MAX_RESOLVER_COUNT; i++) {
            arr[i] = createResolver(defaultEventLoopGroup, nioGroup, defaultDnsCache, defaultDnsCnameCache);
        }


        if (parameters.slave) {
            log.info("Slave dns check starting");
            Map<Host, InetSocketAddress> slaves = getSlaves();
            for (Map.Entry<Host, InetSocketAddress> entry : slaves.entrySet()) {
                Future<InetSocketAddress> resolveFuture = arr[0].resolve(InetSocketAddress.createUnresolved(entry.getKey().host(), entry.getKey().port()));
                resolveFuture.syncUninterruptibly();
                slaves.put(entry.getKey(), resolveFuture.getNow());
            }
            for (int i = 0; i < MAX_MONITOR_COUNT; i++) {
                // Pick different resolver
                scheduleMonitorSlaves(defaultEventLoopGroup, arr[(i+MAX_MONITOR_COUNT) % arr.length], slaves);
            }
        }
        if (parameters.master) {
            log.info("Master dns check starting");
            scheduleMonitorMaster(defaultEventLoopGroup, arr[0], new Host("clustercfg.production-game-objects.nfbjaw.euw1.cache.amazonaws.com", 6379));
        }

        System.out.println("Press any key for exit");
        Scanner userInput = new Scanner(System.in);
        if (!userInput.hasNext()) ;
        System.out.println("Application will stop soon");
    }

    private static AddressResolver<InetSocketAddress> createResolver(NioEventLoopGroup defaultEventLoopGroup, NioEventLoopGroup group, DefaultDnsCache defaultDnsCache, DefaultDnsCnameCache defaultDnsCnameCache) {
        final EventLoop loop = group.next();
        DnsNameResolverBuilder builder = new DnsNameResolverBuilder()
                .eventLoop(loop).channelType(NioDatagramChannel.class).queryTimeoutMillis(5000)
                .dnsQueryLifecycleObserverFactory(new LoggingDnsQueryLifeCycleObserverFactory(LogLevel.INFO))
                .nameServerProvider(DnsServerAddressStreamProviders.platformDefault()).resolveCache(defaultDnsCache).cnameCache(defaultDnsCnameCache);
        DnsAddressResolverGroup resolverGroup = new DnsAddressResolverGroup(builder);
        return resolverGroup.getResolver(defaultEventLoopGroup.next());
    }

    private static void scheduleMonitorSlaves(NioEventLoopGroup defaultEventLoopGroup, AddressResolver<InetSocketAddress> resolver, Map<Host, InetSocketAddress> slaves) {
        defaultEventLoopGroup.schedule(() -> {
            CompletableFuture<Void> future = monitorSlaves(resolver, slaves);
            future.whenComplete((r,e) -> scheduleMonitorSlaves(defaultEventLoopGroup, resolver, slaves));
        }, MONITOR_INTERVAL_SECOND, TimeUnit.SECONDS);
    }

    private static void scheduleMonitorMaster(NioEventLoopGroup defaultEventLoopGroup, AddressResolver<InetSocketAddress> resolver, Host master) {
        defaultEventLoopGroup.schedule(() -> {
            CompletableFuture<Void> future = monitorMaster(resolver, master);
            future.whenComplete((r,e) -> scheduleMonitorMaster(defaultEventLoopGroup, resolver, master));
        }, MONITOR_INTERVAL_SECOND, TimeUnit.SECONDS);
    }

    private static CompletableFuture<Void> monitorMaster(AddressResolver<InetSocketAddress> resolver, Host master) {
        log.info("Monitor master is working");
        CompletableFuture<Void> promise = new CompletableFuture<>();
        Future<List<InetSocketAddress>> allNodes = resolver.resolveAll(InetSocketAddress.createUnresolved(master.host(), master.port()));
        allNodes.addListener(new FutureListener<List<InetSocketAddress>>() {
            @Override
            public void operationComplete(Future<List<InetSocketAddress>> future) throws Exception {
                if (!future.isSuccess()) {
                    promise.complete(null);
                    log.error("Master monitor err=", future.cause());
                    return;
                }
                List<InetSocketAddress> nodes = new ArrayList<>();
                for (InetSocketAddress address : future.getNow()) {
                    nodes.add(address);
                }
                log.info("Resolve ALL node size is {}", nodes.size());
                promise.complete(null);
            }
        });
        return CompletableFuture.allOf(promise);
    }

    private static CompletableFuture<Void> monitorSlaves(AddressResolver<InetSocketAddress> resolver, Map<Host, InetSocketAddress> slaves) {
        log.info("Monitor slaves is working");
        List<CompletableFuture<Void>> futures = new ArrayList<>();
        for (Map.Entry<Host, InetSocketAddress> entry : slaves.entrySet()) {
            CompletableFuture<Void> promise = new CompletableFuture<>();
            futures.add(promise);

            log.debug("Request sent to resolve ip address for slave host: {}", entry.getKey().host());

            Future<InetSocketAddress> resolveFuture = resolver.resolve(InetSocketAddress.createUnresolved(entry.getKey().host(), entry.getKey().port()));
            resolveFuture.addListener((FutureListener<InetSocketAddress>) future -> {
                if (!future.isSuccess()) {
                    log.error("Unable to resolve " + entry.getKey().host(), future.cause());
                    promise.complete(null);
                    return;
                }

                log.debug("Resolved ip: {} for slave host: {}", future.getNow().getAddress(), entry.getKey().host());

                InetSocketAddress currentSlaveAddr = entry.getValue();
                InetSocketAddress newSlaveAddr = future.getNow();
                if (!newSlaveAddr.getAddress().equals(currentSlaveAddr.getAddress())) {
                    log.info("Detected DNS change. Slave {} has changed ip from {} to {}", entry.getKey().host(), currentSlaveAddr.getAddress().getHostAddress(), newSlaveAddr.getAddress().getHostAddress());
                }
                promise.complete(null);
            });
        }
        return CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]));
    }

    private static Map<Host, InetSocketAddress> getSlaves() {
        Map<Host, InetSocketAddress> slaves = new HashMap<>();
      // add 44 nodes of real elsaticache cluster 
        return slaves;

    }
}

Netty version

4.1.75

JVM version (e.g. java -version)

java-17-amazon-corretto-jdk_17.0.3.6-1_amd64

OS version (e.g. uname -a)

ubuntu-bionic-18.04-amd64

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
normanmaurercommented, Sep 28, 2022

I will have a look

1reaction
trustincommented, Nov 21, 2022

I sent out #13014 that adds a new construct called ConcurrencyLimit which abstracts how we should limit concurrent actions, with one simple implementation that’s similar to @mrniko’s AsyncSemaphore. Please let me know what you think. Once approved and merged, I’ll send out a follow-up PR that uses ConcurrencyLimit for limiting DNS queries.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DNS names and underlying IP - Amazon ElastiCache
ElastiCache ensures that both the DNS name and the IP address of the cache node remain the same when cache nodes are recovered...
Read more >
AWS ElastiCache Redis DNS error - Name or service not known
It can also be an issue if the OP is using some form of third-party DNS resolver, such as Microsoft AD Domain Controller...
Read more >
Redisson Using AWS Elasticache / Problems Connecting #4131
io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Search domain query failed.
Read more >
AWS Elasticache name resolution issue on high traffic
2. I solved the problem installing a DNS cache with dnsmasq on the server. – Franklin G. Mendoza. Aug 27, 2018 at 22:14...
Read more >
AWS Route53 DNS management | Redis Documentation Center
If the cluster nameserver (node) asked is down, given the resolution process, the resolver tries to ask another name server (node) from the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found