Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rabbimq] aliveness check doesn't work when rabbitmq service is down

See original GitHub issue

**Output of the info page **

====================
Collector (v 5.15.0)
====================

  Status date: 2017-07-25 06:08:01 (9s ago)
  Pid: 18239
  Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log, syslog:/dev/log

  Clocks
  ======
  
    NTP offset: -0.0021 s
    System UTC time: 2017-07-25 10:08:10.868431
  
  Paths
  =====
  
    conf.d: /etc/dd-agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d
  

  
  Checks
  ======
  
    process (5.15.0)
    ----------------
      - instance #0 [OK]
      - instance #1 [OK]
      - instance #2 [WARNING]
          Warning: No matching process was found
      - instance #3 [OK]
      - instance #4 [OK]
      - instance #5 [OK]
      - Collected 81 metrics, 0 events & 6 service checks
  
    sl_network_tcp (5.15.0)
    -----------------------
      - instance #0 [OK]
      - instance #1 [OK]
      - instance #2 [OK]
      - Collected 3 metrics, 0 events & 0 service checks
  
    network (5.15.0)
    ----------------
      - instance #0 [OK]
      - Collected 19 metrics, 0 events & 0 service checks
  
    ntp (5.15.0)
    ------------
      - Collected 0 metrics, 0 events & 0 service checks
  
    oom (5.15.0)
    ------------
      - instance #0 [OK]
      - Collected 0 metrics, 0 events & 1 service check
  
    slrabbitmq (5.15.0)
    -------------------
      - instance #0 [OK]
      - Collected 0 metrics, 0 events & 1 service check
  
    disk (5.15.0)
    -------------
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 0 service checks
  
  
  Emitters
  ========
  
    - http_emitter [OK]

====================
Dogstatsd (v 5.15.0)
====================

  Status date: 2017-07-25 06:08:08 (3s ago)
  Pid: 18232
  Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log, syslog:/dev/log

  Flush count: 43939
  Packet Count: 747665
  Packets per second: 1.7
  Metric count: 9
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.15.0)
====================

  Status date: 2017-07-25 06:08:11 (1s ago)
  Pid: 18231
  Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log, syslog:/dev/log

  Queue Size: 0 bytes
  Queue Length: 0
  Flush Count: 153869
  Transactions received: 84387
  Transactions flushed: 84387
  Transactions rejected: 0
  API Key Status: API Key is valid
  

======================
Trace Agent (v 5.15.0)
======================

  Pid: 18230
  Uptime: 439737 seconds
  Mem alloc: 886952 bytes

  Hostname: mgmt1.XXXXXXXXXXXXXXXXXXXXXx
  Receiver: localhost:8126
  API Endpoint: https://trace.agent.datadoghq.com

  Bytes received (1 min): 0
  Traces received (1 min): 0
  Spans received (1 min): 0

  Bytes sent (1 min): 0
  Traces sent (1 min): 0
  Stats sent (1 min): 0

Additional environment details (Operating System, Cloud provider, etc):

Cloud provider AWS

Steps to reproduce the issue:

Enable rabbitmq integration for datadog
Stop rabbitmq-server process, to bring rabbitmq service down
Rabbitmq Aliveness check will continue to monitor that rabbitmq service is up

Describe the results you received: Build in rabbitmq integration reports incorrect state of rabbitmq service. This happens because of the way alivness check has been created. Latest version of datadog has following code for the aliveness check:

    def _check_aliveness(self, instance, base_url, vhosts=None, auth=None, ssl_verify=True, skip_proxy=False):
        """
        Check the aliveness API against all or a subset of vhosts. The API
        will return {"status": "ok"} and a 200 response code in the case
        that the check passes.
        """

        if not vhosts:
            # Fetch a list of _all_ vhosts from the API.
            vhosts_url = urlparse.urljoin(base_url, 'vhosts')
            vhost_proxy = self.get_instance_proxy(instance, vhosts_url)
            vhosts_response = self._get_data(vhosts_url, auth=auth, ssl_verify=ssl_verify, proxies=vhost_proxy)
            vhosts = [v['name'] for v in vhosts_response]

        for vhost in vhosts:
            tags = ['vhost:%s' % vhost]
            # We need to urlencode the vhost because it can be '/'.
            path = u'aliveness-test/%s' % (urllib.quote_plus(vhost))
            aliveness_url = urlparse.urljoin(base_url, path)
            aliveness_proxy = self.get_instance_proxy(instance, aliveness_url)
            aliveness_response = self._get_data(aliveness_url, auth=auth, ssl_verify=ssl_verify, proxies=aliveness_proxy)
            message = u"Response from aliveness API: %s" % aliveness_response

            if aliveness_response.get('status') == 'ok':
                status = AgentCheck.OK
            else:
                status = AgentCheck.CRITICAL

            self.service_check('rabbitmq.aliveness', status, tags, message=message)

As you can see it uses function self._get_data in order to get data from rabbitmq API. Below is a code of self._get_data function:

def _get_data(self, url, auth=None, ssl_verify=True, proxies={}):
        try:
            r = requests.get(url, auth=auth, proxies=proxies, timeout=self.default_integration_http_timeout, verify=ssl_verify)
            r.raise_for_status()
            return r.json()
        except RequestException as e:
            raise RabbitMQException('Cannot open RabbitMQ API url: {} {}'.format(url, str(e)))
        except ValueError as e:
            raise RabbitMQException('Cannot parse JSON response from API url: {} {}'.format(url, str(e)))

As you can see in case of issue with connection it will raise an exception, but function _check_aliveness doesn’t have any handling of possible exceptions, so it just stop to send metrics in case rabbitmq service is down. Please correct this behavior.

Describe the results you expected:

Rabbitmq aliveness check should properly report when rabbitmq service is down.

Additional information you deem important (e.g. issue happens only occasionally): Provided status of datadog was done exactly when rabbitmq service was down and as you can see integration has shown following result (please note that due to specific reasons default integration was renamed from rabbitmq to slrabbitmq)

    slrabbitmq (5.15.0)
    -------------------
      - instance #0 [OK]
      - Collected 0 metrics, 0 events & 1 service check

and below is datadog status with started rabbitmq service:

    slrabbitmq (5.15.0)
    -------------------
      - instance #0 [WARNING]
          Warning: Too many queues to fetch. You must choose the queues you are interested in by editing the rabbitmq.yaml configuration file or get in touch with Datadog Support
      - Collected 1847 metrics, 0 events & 3 service checks

So, check does work then rabbimq service is up.

Issue Analytics

State:
Created 6 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

hush-hushcommented, Jul 27, 2017

Hi @dpavlov-smartling,

Thanks for the precision. For the web interface, I added rabbitmq.status to the list of possible check for the rabbitmq integration. It should be available to production soon (but can’t give you a exact timing). In the meanwhile a workaround would be to create a custom monitor indeed:

2017-07-27-113539_1059x275_scrot