question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Redis check times out when running into stale connections

See original GitHub issue

The datadog agent’s current Redis check is implemented with redis-py and caches connections: https://github.com/DataDog/integrations-core/blob/master/redisdb/datadog_checks/redisdb/redisdb.py#L122

When the cached connection goes stale, the check times out and reports an error.

Probable solutions:

  • disable Redis connection pooling and caching to avoid running into stale connections and ensure fresh connections.
  • allow passing socket_keepalive to the Redis constructor to avoid running into stale connections (as suggested by redis-py’s author https://github.com/andymccurdy/redis-py/issues/722#issuecomment-201929330).
  • allow passing retry_on_timeout to retry the check when the connection fails (probably the least desired option).

Related issues:

Note that connection pooling is enabled by default with redis-py: https://github.com/andymccurdy/redis-py#connection-pools

Output of the info page

====================
Collector (v 5.24.0)
====================

  Status date: 2018-06-04 11:10:36 (10s ago)
  Pid: 46
  Platform: Linux-4.4.111+-x86_64-with-debian-9.4
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log

  Clocks
  ======
  
    NTP offset: 0.0028 s
    System UTC time: 2018-06-04 11:10:47.413308
  
  Paths
  =====
  
    conf.d: /etc/dd-agent/conf.d
    checks.d: Not found
  
  Hostnames
  =========
  
    socket-hostname: dd-agent-9mlrr
    hostname: <snip>.internal
    socket-fqdn: dd-agent-9mlrr
  
  Checks
  ======
  
    system_core (1.0.0)
    -------------------
      - instance #0 [OK]
      - Collected 1 metric, 0 events & 0 service checks
  
    network (1.5.0)
    ---------------
      - instance #0 [OK]
      - Collected 0 metrics, 0 events & 0 service checks
  
    kubernetes (1.5.0)
    ------------------
      - instance #0 [OK]
      - Collected 175 metrics, 0 events & 3 service checks
  
    redisdb (1.5.0)
    ---------------
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 1 service check
  
    ntp (1.2.0)
    -----------
      - instance #0 [OK]
      - Collected 1 metric, 0 events & 1 service check
  
    disk (1.2.0)
    ------------
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 0 service checks
  
    kube_proxy (Unknown Wheel)
    --------------------------
      - Collected 0 metrics, 0 events & 0 service checks
  
    docker_daemon (1.10.0)
    ----------------------
      - instance #0 [OK]
      - Collected 242 metrics, 2 events & 1 service check
  
    http_check (2.0.1)
    ------------------
      - instance #0 [OK]
      - instance #1 [OK]
      - Collected 0 metrics, 0 events & 0 service checks
  
  
  Emitters
  ========
  
    - http_emitter [OK]

====================
Dogstatsd (v 5.24.0)
====================

  Status date: 2018-06-04 11:10:39 (8s ago)
  Pid: 35
  Platform: Linux-4.4.111+-x86_64-with-debian-9.4
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log

  Flush count: 35054
  Packet Count: 1162402
  Packets per second: 2.1
  Metric count: 53
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.24.0)
====================

  Status date: 2018-06-04 11:10:43 (4s ago)
  Pid: 34
  Platform: Linux-4.4.111+-x86_64-with-debian-9.4
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log

  Queue Size: 0 bytes
  Queue Length: 0
  Flush Count: 114032
  Transactions received: 86677
  Transactions flushed: 86677
  Transactions rejected: 0
  API Key Status: API Key is valid
  

======================
Trace Agent (v 5.24.0)
======================

  Pid: 33
  Uptime: 351241 seconds
  Mem alloc: 3756120 bytes

  Hostname: <snip>.internal
  Receiver: 0.0.0.0:8126
  API Endpoint: https://trace.agent.datadoghq.com

  --- Receiver stats (1 min) ---

  From go 1.9.2 (gc-amd64-linux), client v0.5.0
    Traces received: 11 (4256 bytes)
    Spans received: 22
    Services received: 0 (0 bytes)


  --- Writer stats (1 min) ---

  Traces: 5 payloads, 6 traces, 2389 bytes
  Stats: 4 payloads, 4 stats buckets, 2319 bytes
  Services: 0 payloads, 0 services, 0 bytes

Additional environment details (Operating System, Cloud provider, etc):

  • Official docker image 12.6.5240
  • Google Kubernetes Engine 1.8.10.gke0 with Container Optimized OS

Steps to reproduce the issue:

  1. install datadog agent on kubernetes using official docker image 12.6.5240
  2. check the logs

Describe the results you received:

2018-06-04 11:08:37 UTC | INFO | dd.collector | config(config.py:1249) | initialized checks.d checks: ['system_core', 'network', 'kubernetes', 'redisdb', 'ntp', 'disk', 'kube_proxy', 'docker_daemon', 'http_check']
2018-06-04 11:08:37 UTC | INFO | dd.collector | config(config.py:1250) | initialization failed checks.d checks: []
2018-06-04 11:08:37 UTC | INFO | dd.collector | collector(agent.py:166) | Check reload was successful. Running 10 checks.
2018-06-04 11:08:51 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
Traceback (most recent call last):
  File "/opt/datadog-agent/agent/checks/__init__.py", line 812, in run
    self.check(copy.deepcopy(instance))
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/redisdb/redisdb.py", line 377, in check
    self._check_db(instance, custom_tags)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/redisdb/redisdb.py", line 173, in _check_db
    info = conn.info()
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/client.py", line 665, in info
    return self.execute_command('INFO')
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/client.py", line 578, in execute_command
    connection.send_command(*args)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/connection.py", line 563, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/connection.py", line 538, in send_packed_command
    self.connect()
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/connection.py", line 442, in connect
    raise ConnectionError(self._error_message(e))
ConnectionError: Error connecting to 10.0.0.4:6379. timed out.
2018-06-04 11:08:52 UTC | INFO | dd.collector | checks.http_check(network_checks.py:93) | Starting Thread Pool

This happens frequently, but not every time:

root@dd-agent-9mlrr:/# grep "ERROR.*checks.redisdb" /var/log/datadog/collector.log  | tail -40
2018-06-04 08:48:08 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 08:52:35 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 08:53:05 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:03:45 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:10:07 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:10:37 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:20:48 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:24:37 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:25:07 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:26:38 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:28:52 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:30:52 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:36:26 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:41:12 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:46:18 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:47:13 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:47:42 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:48:12 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:54:54 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:59:12 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:04:06 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:25:49 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:32:00 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:33:14 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:33:44 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:42:57 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:46:05 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:51:16 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:51:46 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:58:26 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:00:35 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:03:54 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:07:02 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:08:51 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:09:21 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:10:11 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:11:45 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:18:09 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:22:06 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:22:36 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed

Describe the results you expected:

No error.

Additional information you deem important (e.g. issue happens only occasionally):

Currently only observed on a high load GKE cluster (12 nodes). Does not happen on a low load GKE cluster (4 nodes) with similar setup (both built with terraform).

Running redis-cli in parallel never triggers timeouts:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: redis-test
  namespace: monitoring
spec:
  template:
    metadata:
      labels:
        app: redis-test
      name: redis-test
    spec:
      containers:
      - name: redis-test
        image: redis:3.2.11
        imagePullPolicy: IfNotPresent
        command:
          - "sh"
          - "-c"
          - |
            while true ; do printf "$(hostname | tr -d '\n') $(date | tr -d '\n'): " ; timeout 5 redis-cli -h 10.0.0.4 INFO | grep uptime_in_seconds || echo Terminated ; sleep 1 ; done
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "256Mi"
            cpu: "200m"

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ofekcommented, Jun 11, 2018

@pdecat You can use datadog/dev-dd-agent:beta.

1reaction
mascicommented, Jun 6, 2018

@pdecat a fix that might mitigate the issue was merged with #1668 we’re one week away from releasing 6.3/5.25 but in the meantime you can give the new check a spin if you like by building the package yourself (https://github.com/DataDog/integrations-core/blob/master/docs/dev/README.md#building) and installing it with the pip embedded in the agent (I see you’re using the Docker Agent so this might be tricky but still an option).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Redis client handling
This means that it is possible that while the timeout is set to 10 seconds, the client connection will be closed, for instance,...
Read more >
Why Am I Seeing a Timeout Error When Reading Data from ...
When you read data from Redis, timeout error "redis server response timeout (3000ms) occurred after 3 retry attempts" is returned.
Read more >
Is there any way to close idle redis-client connection?
The parameter connectTimeout is the timeout for attempting to connect Redis server instead of the close time of idle connections.
Read more >
Troubleshoot Azure Cache for Redis latency and timeouts
This section discusses troubleshooting for latency and timeout issues that occur when connecting to Azure Cache for Redis.
Read more >
increase redis connection timeout (#1831) · Issues - GitLab
We suspect there are multiple components to the problem and establishing new connections is just one of them (aggressive evictions and 1m bursts ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found