Redis check times out when running into stale connections
See original GitHub issueThe datadog agent’s current Redis check is implemented with redis-py and caches connections: https://github.com/DataDog/integrations-core/blob/master/redisdb/datadog_checks/redisdb/redisdb.py#L122
When the cached connection goes stale, the check times out and reports an error.
Probable solutions:
- disable Redis connection pooling and caching to avoid running into stale connections and ensure fresh connections.
- allow passing
socket_keepalive
to the Redis constructor to avoid running into stale connections (as suggested by redis-py’s author https://github.com/andymccurdy/redis-py/issues/722#issuecomment-201929330). - allow passing
retry_on_timeout
to retry the check when the connection fails (probably the least desired option).
Related issues:
- https://github.com/andymccurdy/redis-py/issues/306
- https://github.com/andymccurdy/redis-py/issues/722
Note that connection pooling is enabled by default with redis-py: https://github.com/andymccurdy/redis-py#connection-pools
Output of the info page
====================
Collector (v 5.24.0)
====================
Status date[0m: 2018-06-04 11:10:36 (10s ago)[0m
Pid: 46
Platform: Linux-4.4.111+-x86_64-with-debian-9.4
Python Version: 2.7.14, 64bit
Logs: <stderr>, /var/log/datadog/collector.log
Clocks
======
NTP offset[0m: 0.0028 s[0m
System UTC time: 2018-06-04 11:10:47.413308
Paths
=====
conf.d: /etc/dd-agent/conf.d
checks.d: Not found
Hostnames
=========
socket-hostname: dd-agent-9mlrr
hostname: <snip>.internal
socket-fqdn: dd-agent-9mlrr
Checks
======
system_core (1.0.0)
-------------------
- instance #0 [[32mOK[0m]
- Collected 1 metric, 0 events & 0 service checks
network (1.5.0)
---------------
- instance #0 [[32mOK[0m]
- Collected 0 metrics, 0 events & 0 service checks
kubernetes (1.5.0)
------------------
- instance #0 [[32mOK[0m]
- Collected 175 metrics, 0 events & 3 service checks
redisdb (1.5.0)
---------------
- instance #0 [[32mOK[0m]
- Collected 32 metrics, 0 events & 1 service check
ntp (1.2.0)
-----------
- instance #0 [[32mOK[0m]
- Collected 1 metric, 0 events & 1 service check
disk (1.2.0)
------------
- instance #0 [[32mOK[0m]
- Collected 32 metrics, 0 events & 0 service checks
kube_proxy (Unknown Wheel)
--------------------------
- Collected 0 metrics, 0 events & 0 service checks
docker_daemon (1.10.0)
----------------------
- instance #0 [[32mOK[0m]
- Collected 242 metrics, 2 events & 1 service check
http_check (2.0.1)
------------------
- instance #0 [[32mOK[0m]
- instance #1 [[32mOK[0m]
- Collected 0 metrics, 0 events & 0 service checks
Emitters
========
- http_emitter [[32mOK[0m]
====================
Dogstatsd (v 5.24.0)
====================
Status date[0m: 2018-06-04 11:10:39 (8s ago)[0m
Pid: 35
Platform: Linux-4.4.111+-x86_64-with-debian-9.4
Python Version: 2.7.14, 64bit
Logs: <stderr>, /var/log/datadog/dogstatsd.log
Flush count: 35054
Packet Count: 1162402
Packets per second: 2.1
Metric count: 53
Event count: 0
Service check count: 0
====================
Forwarder (v 5.24.0)
====================
Status date[0m: 2018-06-04 11:10:43 (4s ago)[0m
Pid: 34
Platform: Linux-4.4.111+-x86_64-with-debian-9.4
Python Version: 2.7.14, 64bit
Logs: <stderr>, /var/log/datadog/forwarder.log
Queue Size: 0 bytes
Queue Length: 0
Flush Count: 114032
Transactions received: 86677
Transactions flushed: 86677
Transactions rejected: 0
API Key Status: API Key is valid
======================
Trace Agent (v 5.24.0)
======================
Pid: 33
Uptime: 351241 seconds
Mem alloc: 3756120 bytes
Hostname: <snip>.internal
Receiver: 0.0.0.0:8126
API Endpoint: https://trace.agent.datadoghq.com
--- Receiver stats (1 min) ---
From go 1.9.2 (gc-amd64-linux), client v0.5.0
Traces received: 11 (4256 bytes)
Spans received: 22
Services received: 0 (0 bytes)
--- Writer stats (1 min) ---
Traces: 5 payloads, 6 traces, 2389 bytes
Stats: 4 payloads, 4 stats buckets, 2319 bytes
Services: 0 payloads, 0 services, 0 bytes
Additional environment details (Operating System, Cloud provider, etc):
- Official docker image 12.6.5240
- Google Kubernetes Engine 1.8.10.gke0 with Container Optimized OS
Steps to reproduce the issue:
- install datadog agent on kubernetes using official docker image 12.6.5240
- check the logs
Describe the results you received:
2018-06-04 11:08:37 UTC | INFO | dd.collector | config(config.py:1249) | initialized checks.d checks: ['system_core', 'network', 'kubernetes', 'redisdb', 'ntp', 'disk', 'kube_proxy', 'docker_daemon', 'http_check']
2018-06-04 11:08:37 UTC | INFO | dd.collector | config(config.py:1250) | initialization failed checks.d checks: []
2018-06-04 11:08:37 UTC | INFO | dd.collector | collector(agent.py:166) | Check reload was successful. Running 10 checks.
2018-06-04 11:08:51 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
Traceback (most recent call last):
File "/opt/datadog-agent/agent/checks/__init__.py", line 812, in run
self.check(copy.deepcopy(instance))
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/redisdb/redisdb.py", line 377, in check
self._check_db(instance, custom_tags)
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/redisdb/redisdb.py", line 173, in _check_db
info = conn.info()
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/client.py", line 665, in info
return self.execute_command('INFO')
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/client.py", line 578, in execute_command
connection.send_command(*args)
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/connection.py", line 563, in send_command
self.send_packed_command(self.pack_command(*args))
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/connection.py", line 538, in send_packed_command
self.connect()
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/redis/connection.py", line 442, in connect
raise ConnectionError(self._error_message(e))
ConnectionError: Error connecting to 10.0.0.4:6379. timed out.
2018-06-04 11:08:52 UTC | INFO | dd.collector | checks.http_check(network_checks.py:93) | Starting Thread Pool
This happens frequently, but not every time:
root@dd-agent-9mlrr:/# grep "ERROR.*checks.redisdb" /var/log/datadog/collector.log | tail -40
2018-06-04 08:48:08 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 08:52:35 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 08:53:05 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:03:45 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:10:07 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:10:37 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:20:48 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:24:37 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:25:07 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:26:38 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:28:52 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:30:52 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:36:26 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:41:12 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:46:18 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:47:13 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:47:42 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:48:12 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:54:54 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 09:59:12 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:04:06 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:25:49 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:32:00 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:33:14 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:33:44 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:42:57 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:46:05 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:51:16 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:51:46 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 10:58:26 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:00:35 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:03:54 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:07:02 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:08:51 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:09:21 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:10:11 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:11:45 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:18:09 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:22:06 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
2018-06-04 11:22:36 UTC | ERROR | dd.collector | checks.redisdb(__init__.py:829) | Check 'redisdb' instance #0 failed
Describe the results you expected:
No error.
Additional information you deem important (e.g. issue happens only occasionally):
Currently only observed on a high load GKE cluster (12 nodes). Does not happen on a low load GKE cluster (4 nodes) with similar setup (both built with terraform).
Running redis-cli in parallel never triggers timeouts:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: redis-test
namespace: monitoring
spec:
template:
metadata:
labels:
app: redis-test
name: redis-test
spec:
containers:
- name: redis-test
image: redis:3.2.11
imagePullPolicy: IfNotPresent
command:
- "sh"
- "-c"
- |
while true ; do printf "$(hostname | tr -d '\n') $(date | tr -d '\n'): " ; timeout 5 redis-cli -h 10.0.0.4 INFO | grep uptime_in_seconds || echo Terminated ; sleep 1 ; done
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "256Mi"
cpu: "200m"
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
@pdecat You can use
datadog/dev-dd-agent:beta
.@pdecat a fix that might mitigate the issue was merged with #1668 we’re one week away from releasing 6.3/5.25 but in the meantime you can give the new check a spin if you like by building the package yourself (https://github.com/DataDog/integrations-core/blob/master/docs/dev/README.md#building) and installing it with the
pip
embedded in the agent (I see you’re using the Docker Agent so this might be tricky but still an option).