[rabbimq] aliveness check doesn't work when rabbitmq service is down
See original GitHub issue**Output of the info page **
====================
Collector (v 5.15.0)
====================
Status date: 2017-07-25 06:08:01 (9s ago)
Pid: 18239
Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
Python Version: 2.7.13, 64bit
Logs: <stderr>, /var/log/datadog/collector.log, syslog:/dev/log
Clocks
======
NTP offset: -0.0021 s
System UTC time: 2017-07-25 10:08:10.868431
Paths
=====
conf.d: /etc/dd-agent/conf.d
checks.d: /opt/datadog-agent/agent/checks.d
Checks
======
process (5.15.0)
----------------
- instance #0 [OK]
- instance #1 [OK]
- instance #2 [WARNING]
Warning: No matching process was found
- instance #3 [OK]
- instance #4 [OK]
- instance #5 [OK]
- Collected 81 metrics, 0 events & 6 service checks
sl_network_tcp (5.15.0)
-----------------------
- instance #0 [OK]
- instance #1 [OK]
- instance #2 [OK]
- Collected 3 metrics, 0 events & 0 service checks
network (5.15.0)
----------------
- instance #0 [OK]
- Collected 19 metrics, 0 events & 0 service checks
ntp (5.15.0)
------------
- Collected 0 metrics, 0 events & 0 service checks
oom (5.15.0)
------------
- instance #0 [OK]
- Collected 0 metrics, 0 events & 1 service check
slrabbitmq (5.15.0)
-------------------
- instance #0 [OK]
- Collected 0 metrics, 0 events & 1 service check
disk (5.15.0)
-------------
- instance #0 [OK]
- Collected 32 metrics, 0 events & 0 service checks
Emitters
========
- http_emitter [OK]
====================
Dogstatsd (v 5.15.0)
====================
Status date: 2017-07-25 06:08:08 (3s ago)
Pid: 18232
Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
Python Version: 2.7.13, 64bit
Logs: <stderr>, /var/log/datadog/dogstatsd.log, syslog:/dev/log
Flush count: 43939
Packet Count: 747665
Packets per second: 1.7
Metric count: 9
Event count: 0
Service check count: 0
====================
Forwarder (v 5.15.0)
====================
Status date: 2017-07-25 06:08:11 (1s ago)
Pid: 18231
Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
Python Version: 2.7.13, 64bit
Logs: <stderr>, /var/log/datadog/forwarder.log, syslog:/dev/log
Queue Size: 0 bytes
Queue Length: 0
Flush Count: 153869
Transactions received: 84387
Transactions flushed: 84387
Transactions rejected: 0
API Key Status: API Key is valid
======================
Trace Agent (v 5.15.0)
======================
Pid: 18230
Uptime: 439737 seconds
Mem alloc: 886952 bytes
Hostname: mgmt1.XXXXXXXXXXXXXXXXXXXXXx
Receiver: localhost:8126
API Endpoint: https://trace.agent.datadoghq.com
Bytes received (1 min): 0
Traces received (1 min): 0
Spans received (1 min): 0
Bytes sent (1 min): 0
Traces sent (1 min): 0
Stats sent (1 min): 0
Additional environment details (Operating System, Cloud provider, etc):
- Cloud provider AWS
Steps to reproduce the issue:
- Enable rabbitmq integration for datadog
- Stop rabbitmq-server process, to bring rabbitmq service down
- Rabbitmq Aliveness check will continue to monitor that rabbitmq service is up
Describe the results you received: Build in rabbitmq integration reports incorrect state of rabbitmq service. This happens because of the way alivness check has been created. Latest version of datadog has following code for the aliveness check:
def _check_aliveness(self, instance, base_url, vhosts=None, auth=None, ssl_verify=True, skip_proxy=False):
"""
Check the aliveness API against all or a subset of vhosts. The API
will return {"status": "ok"} and a 200 response code in the case
that the check passes.
"""
if not vhosts:
# Fetch a list of _all_ vhosts from the API.
vhosts_url = urlparse.urljoin(base_url, 'vhosts')
vhost_proxy = self.get_instance_proxy(instance, vhosts_url)
vhosts_response = self._get_data(vhosts_url, auth=auth, ssl_verify=ssl_verify, proxies=vhost_proxy)
vhosts = [v['name'] for v in vhosts_response]
for vhost in vhosts:
tags = ['vhost:%s' % vhost]
# We need to urlencode the vhost because it can be '/'.
path = u'aliveness-test/%s' % (urllib.quote_plus(vhost))
aliveness_url = urlparse.urljoin(base_url, path)
aliveness_proxy = self.get_instance_proxy(instance, aliveness_url)
aliveness_response = self._get_data(aliveness_url, auth=auth, ssl_verify=ssl_verify, proxies=aliveness_proxy)
message = u"Response from aliveness API: %s" % aliveness_response
if aliveness_response.get('status') == 'ok':
status = AgentCheck.OK
else:
status = AgentCheck.CRITICAL
self.service_check('rabbitmq.aliveness', status, tags, message=message)
As you can see it uses function self._get_data
in order to get data from rabbitmq API. Below is a code of self._get_data
function:
def _get_data(self, url, auth=None, ssl_verify=True, proxies={}):
try:
r = requests.get(url, auth=auth, proxies=proxies, timeout=self.default_integration_http_timeout, verify=ssl_verify)
r.raise_for_status()
return r.json()
except RequestException as e:
raise RabbitMQException('Cannot open RabbitMQ API url: {} {}'.format(url, str(e)))
except ValueError as e:
raise RabbitMQException('Cannot parse JSON response from API url: {} {}'.format(url, str(e)))
As you can see in case of issue with connection it will raise an exception, but function _check_aliveness
doesn’t have any handling of possible exceptions, so it just stop to send metrics in case rabbitmq service is down. Please correct this behavior.
Describe the results you expected:
Rabbitmq aliveness check should properly report when rabbitmq service is down.
Additional information you deem important (e.g. issue happens only occasionally): Provided status of datadog was done exactly when rabbitmq service was down and as you can see integration has shown following result (please note that due to specific reasons default integration was renamed from rabbitmq to slrabbitmq)
slrabbitmq (5.15.0)
-------------------
- instance #0 [OK]
- Collected 0 metrics, 0 events & 1 service check
and below is datadog status with started rabbitmq service:
slrabbitmq (5.15.0)
-------------------
- instance #0 [WARNING]
Warning: Too many queues to fetch. You must choose the queues you are interested in by editing the rabbitmq.yaml configuration file or get in touch with Datadog Support
- Collected 1847 metrics, 0 events & 3 service checks
So, check does work then rabbimq service is up.
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Hi @dpavlov-smartling,
Thanks for the precision. For the web interface, I added
rabbitmq.status
to the list of possible check for the rabbitmq integration. It should be available to production soon (but can’t give you a exact timing). In the meanwhile a workaround would be to create a custom monitor indeed:I’m sorry for the inconvenience.
For the
rabbitmq.aliveness
: I’ll write a fix to sendCRITICAL
if the rabbitmq server is down. It will be available with the agent5.17.0
.Thanks for your patience.
@dpavlov-smartling: The change for
rabbitmq.status
in the integration section is live. Sorry for the delay.