question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rabbimq] aliveness check doesn't work when rabbitmq service is down

See original GitHub issue

**Output of the info page **

====================
Collector (v 5.15.0)
====================

  Status date: 2017-07-25 06:08:01 (9s ago)
  Pid: 18239
  Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log, syslog:/dev/log

  Clocks
  ======
  
    NTP offset: -0.0021 s
    System UTC time: 2017-07-25 10:08:10.868431
  
  Paths
  =====
  
    conf.d: /etc/dd-agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d
  

  
  Checks
  ======
  
    process (5.15.0)
    ----------------
      - instance #0 [OK]
      - instance #1 [OK]
      - instance #2 [WARNING]
          Warning: No matching process was found
      - instance #3 [OK]
      - instance #4 [OK]
      - instance #5 [OK]
      - Collected 81 metrics, 0 events & 6 service checks
  
    sl_network_tcp (5.15.0)
    -----------------------
      - instance #0 [OK]
      - instance #1 [OK]
      - instance #2 [OK]
      - Collected 3 metrics, 0 events & 0 service checks
  
    network (5.15.0)
    ----------------
      - instance #0 [OK]
      - Collected 19 metrics, 0 events & 0 service checks
  
    ntp (5.15.0)
    ------------
      - Collected 0 metrics, 0 events & 0 service checks
  
    oom (5.15.0)
    ------------
      - instance #0 [OK]
      - Collected 0 metrics, 0 events & 1 service check
  
    slrabbitmq (5.15.0)
    -------------------
      - instance #0 [OK]
      - Collected 0 metrics, 0 events & 1 service check
  
    disk (5.15.0)
    -------------
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 0 service checks
  
  
  Emitters
  ========
  
    - http_emitter [OK]

====================
Dogstatsd (v 5.15.0)
====================

  Status date: 2017-07-25 06:08:08 (3s ago)
  Pid: 18232
  Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log, syslog:/dev/log

  Flush count: 43939
  Packet Count: 747665
  Packets per second: 1.7
  Metric count: 9
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.15.0)
====================

  Status date: 2017-07-25 06:08:11 (1s ago)
  Pid: 18231
  Platform: Linux-4.4.41-36.55.amzn1.x86_64-x86_64-with-glibc2.3
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log, syslog:/dev/log

  Queue Size: 0 bytes
  Queue Length: 0
  Flush Count: 153869
  Transactions received: 84387
  Transactions flushed: 84387
  Transactions rejected: 0
  API Key Status: API Key is valid
  

======================
Trace Agent (v 5.15.0)
======================

  Pid: 18230
  Uptime: 439737 seconds
  Mem alloc: 886952 bytes

  Hostname: mgmt1.XXXXXXXXXXXXXXXXXXXXXx
  Receiver: localhost:8126
  API Endpoint: https://trace.agent.datadoghq.com

  Bytes received (1 min): 0
  Traces received (1 min): 0
  Spans received (1 min): 0

  Bytes sent (1 min): 0
  Traces sent (1 min): 0
  Stats sent (1 min): 0

Additional environment details (Operating System, Cloud provider, etc):

  • Cloud provider AWS

Steps to reproduce the issue:

  1. Enable rabbitmq integration for datadog
  2. Stop rabbitmq-server process, to bring rabbitmq service down
  3. Rabbitmq Aliveness check will continue to monitor that rabbitmq service is up

Describe the results you received: Build in rabbitmq integration reports incorrect state of rabbitmq service. This happens because of the way alivness check has been created. Latest version of datadog has following code for the aliveness check:

    def _check_aliveness(self, instance, base_url, vhosts=None, auth=None, ssl_verify=True, skip_proxy=False):
        """
        Check the aliveness API against all or a subset of vhosts. The API
        will return {"status": "ok"} and a 200 response code in the case
        that the check passes.
        """

        if not vhosts:
            # Fetch a list of _all_ vhosts from the API.
            vhosts_url = urlparse.urljoin(base_url, 'vhosts')
            vhost_proxy = self.get_instance_proxy(instance, vhosts_url)
            vhosts_response = self._get_data(vhosts_url, auth=auth, ssl_verify=ssl_verify, proxies=vhost_proxy)
            vhosts = [v['name'] for v in vhosts_response]

        for vhost in vhosts:
            tags = ['vhost:%s' % vhost]
            # We need to urlencode the vhost because it can be '/'.
            path = u'aliveness-test/%s' % (urllib.quote_plus(vhost))
            aliveness_url = urlparse.urljoin(base_url, path)
            aliveness_proxy = self.get_instance_proxy(instance, aliveness_url)
            aliveness_response = self._get_data(aliveness_url, auth=auth, ssl_verify=ssl_verify, proxies=aliveness_proxy)
            message = u"Response from aliveness API: %s" % aliveness_response

            if aliveness_response.get('status') == 'ok':
                status = AgentCheck.OK
            else:
                status = AgentCheck.CRITICAL

            self.service_check('rabbitmq.aliveness', status, tags, message=message)

As you can see it uses function self._get_data in order to get data from rabbitmq API. Below is a code of self._get_data function:

def _get_data(self, url, auth=None, ssl_verify=True, proxies={}):
        try:
            r = requests.get(url, auth=auth, proxies=proxies, timeout=self.default_integration_http_timeout, verify=ssl_verify)
            r.raise_for_status()
            return r.json()
        except RequestException as e:
            raise RabbitMQException('Cannot open RabbitMQ API url: {} {}'.format(url, str(e)))
        except ValueError as e:
            raise RabbitMQException('Cannot parse JSON response from API url: {} {}'.format(url, str(e)))

As you can see in case of issue with connection it will raise an exception, but function _check_aliveness doesn’t have any handling of possible exceptions, so it just stop to send metrics in case rabbitmq service is down. Please correct this behavior.

Describe the results you expected:

Rabbitmq aliveness check should properly report when rabbitmq service is down.

Additional information you deem important (e.g. issue happens only occasionally): Provided status of datadog was done exactly when rabbitmq service was down and as you can see integration has shown following result (please note that due to specific reasons default integration was renamed from rabbitmq to slrabbitmq)

    slrabbitmq (5.15.0)
    -------------------
      - instance #0 [OK]
      - Collected 0 metrics, 0 events & 1 service check

and below is datadog status with started rabbitmq service:

    slrabbitmq (5.15.0)
    -------------------
      - instance #0 [WARNING]
          Warning: Too many queues to fetch. You must choose the queues you are interested in by editing the rabbitmq.yaml configuration file or get in touch with Datadog Support
      - Collected 1847 metrics, 0 events & 3 service checks

So, check does work then rabbimq service is up.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
hush-hushcommented, Jul 27, 2017

Hi @dpavlov-smartling,

Thanks for the precision. For the web interface, I added rabbitmq.status to the list of possible check for the rabbitmq integration. It should be available to production soon (but can’t give you a exact timing). In the meanwhile a workaround would be to create a custom monitor indeed:

2017-07-27-113539_1059x275_scrot

I’m sorry for the inconvenience.

For the rabbitmq.aliveness: I’ll write a fix to send CRITICAL if the rabbitmq server is down. It will be available with the agent 5.17.0.

Thanks for your patience.

0reactions
hush-hushcommented, Aug 8, 2017

@dpavlov-smartling: The change for rabbitmq.status in the integration section is live. Sorry for the delay.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Monitoring - RabbitMQ
In this guide we define monitoring as a process of capturing the behaviour of a system via health checks and metrics over time....
Read more >
RabbitMQ HTTP API call to aliveness-test returns 404 but ...
I'm using the latest stable version (2.8.7) of RabbitMQ and obviously have the management plugin installed for the API to work with the...
Read more >
Chapter 10. Monitoring: Houston, we have a problem
Checking aliveness with the REST API. Testing that RabbitMQ is accepting new connections and able to build an AMQP channel is a good...
Read more >
aliveness-test is missing from version rabbitmq 3.4.1 Erlang ...
It was actually a problem with the queue itself, marked as durable=true. Once I removed it, it was recreated as durable=false and I...
Read more >
RabbitMQ management and troubleshooting - OutSystems 11 ...
You can use the CLI tools provided by RabbitMQ to check the service status. Do the following: Open a command-line console (run as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found