Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Confusing / inconsistent errors from interaction between Pebble and Juju events and handlers

See original GitHub issue

I’ve frequently seen charm authors who are new to the sidecar pattern write something along the lines of this in their charm:

class MyCharm(CharmBase):
    def __init__(self, *args):
        super().__init__(*args)
        self.framework.observe(self.on.mycharm_pebble_ready, self._create_layer)
        self.framework.observe(self.on.config_changed, self._restart_service)

    def _create_layer(self, event):
        container = self.unit.get_container("mycharm")
        container.add_layer("mycharm", {"services": {"mycharm": {...}}})
        container.autostart()

    def _restart_service(self, event):
        container = self.unit.get_container("mycharm")
        # presumably introspect and / or update the service, then...
        container.stop("mycharm")
        container.start("mycharm")

This seems like a straightforward and reasonable way for someone to start out approaching this, but can result in any one of 5 different outcomes:

It might work fine. Depending on timing and how the _restart_service handler does or doesn’t implement updating the service based on config changes, this might deploy and run fine, at least most of the time.
It might raise ops.pebble.ConnectionError. Depending on how quickly Pebble becomes ready to accept connections, it might fail trying to talk to Pebble at all. Worse, this could be an intermittent failure. Additionally, this won’t ever happen in unit tests because the _TestingPebbleClient is always ready immediately.
It might raise RuntimeError: 400 Bad Request: service "mycharm" does not exist. This is somewhat related to #514 but is slightly different and might be specific to the _TestingPebbleClient.
It might raise ops.pebble.APIError: 400 Bad Request: service "mycharm" does not exist. A charm that raised the previous RuntimeError during unit tests would most likely raise this during an actual deployment.
It might raise ops.model.ModelError: service 'mycharm' not found. If the _restart_service handler does an explicit container.get_service("mycharm"), then it will get this rather than either of the previous two errors, unless it calls container.add_layer(layer_name, layer_definition, combine=True) first.

It would at least be good to ensure that all of the latter 3 cases result in a single ops.pebble.UnknownServiceError or something, but I’ve found that new charmers will still be confused as to why the service isn’t recognized despite them having defined it during the pebble-ready event. Maybe the additional message on the UnknownServiceError can include a hint such as (typically due to referencing a service before add_layer is called).

It would also be good to make the testing harness always raise ConnectionError until the pebble-ready event is triggered, to force charm authors to consider that possibility.

A container.restart(service) helper would be nice, as well.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:11 (10 by maintainers)

Top GitHub Comments

1reaction

pengalecommented, Jun 30, 2021

Agree that we should try to avoid exposing the exception to the operator framework, mainly because exceptions should be reserved for exceptional situations – this is something routine that we should handle in a more routine fashion.

The underlying design of Juju is to just let a hook fail with a non zero exit code if there’s a timing issue or similar with running it. I like the rbarry82’s design above, though I’d like to see the typical pattern look like:

def mycoolfunc():
    if not container.pebbly.ready:
        return # Or do something like 'raise WaitingForInfra'
    ...

I think that’s most in keeping with Python best practices, while handling the vagaries of an asynchronous system.

0reactions

johnscacommented, Jun 30, 2021

Ok, so I also agree that these errors don’t belong in the Juju log, but more because I think it will be incumbent on the charm author to handle them. If we encourage them to handle it using guard clauses, I think that will be fine in most cases, though there’s a small chance of a race condition. However, in either case, if they don’t do the check, something will fail that they don’t expect and an error will end up in the log. Ideally, that error quickly and easily points them to what they need to do to fix it, and I think that a stack-trace with a PebbleNotReady error at the bottom makes it very clear what went wrong and how to fix it. And, in the end, I think the “I forgot to check it” case is probably going to end up with a PebbleNotReady or some other exception being thrown anyway from whatever unguarded code tries and fails to talk to Pebble.