Confusing / inconsistent errors from interaction between Pebble and Juju events and handlers
See original GitHub issueI’ve frequently seen charm authors who are new to the sidecar pattern write something along the lines of this in their charm:
class MyCharm(CharmBase):
def __init__(self, *args):
super().__init__(*args)
self.framework.observe(self.on.mycharm_pebble_ready, self._create_layer)
self.framework.observe(self.on.config_changed, self._restart_service)
def _create_layer(self, event):
container = self.unit.get_container("mycharm")
container.add_layer("mycharm", {"services": {"mycharm": {...}}})
container.autostart()
def _restart_service(self, event):
container = self.unit.get_container("mycharm")
# presumably introspect and / or update the service, then...
container.stop("mycharm")
container.start("mycharm")
This seems like a straightforward and reasonable way for someone to start out approaching this, but can result in any one of 5 different outcomes:
-
It might work fine. Depending on timing and how the
_restart_service
handler does or doesn’t implement updating the service based on config changes, this might deploy and run fine, at least most of the time. -
It might raise
ops.pebble.ConnectionError
. Depending on how quickly Pebble becomes ready to accept connections, it might fail trying to talk to Pebble at all. Worse, this could be an intermittent failure. Additionally, this won’t ever happen in unit tests because the_TestingPebbleClient
is always ready immediately. -
It might raise
RuntimeError: 400 Bad Request: service "mycharm" does not exist
. This is somewhat related to #514 but is slightly different and might be specific to the_TestingPebbleClient
. -
It might raise
ops.pebble.APIError: 400 Bad Request: service "mycharm" does not exist
. A charm that raised the previousRuntimeError
during unit tests would most likely raise this during an actual deployment. -
It might raise
ops.model.ModelError: service 'mycharm' not found
. If the_restart_service
handler does an explicitcontainer.get_service("mycharm")
, then it will get this rather than either of the previous two errors, unless it callscontainer.add_layer(layer_name, layer_definition, combine=True)
first.
It would at least be good to ensure that all of the latter 3 cases result in a single ops.pebble.UnknownServiceError
or something, but I’ve found that new charmers will still be confused as to why the service isn’t recognized despite them having defined it during the pebble-ready
event. Maybe the additional message on the UnknownServiceError
can include a hint such as (typically due to referencing a service before add_layer is called)
.
It would also be good to make the testing harness always raise ConnectionError
until the pebble-ready
event is triggered, to force charm authors to consider that possibility.
A container.restart(service)
helper would be nice, as well.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:11 (10 by maintainers)
Top GitHub Comments
Agree that we should try to avoid exposing the exception to the operator framework, mainly because exceptions should be reserved for exceptional situations – this is something routine that we should handle in a more routine fashion.
The underlying design of Juju is to just let a hook fail with a non zero exit code if there’s a timing issue or similar with running it. I like the rbarry82’s design above, though I’d like to see the typical pattern look like:
I think that’s most in keeping with Python best practices, while handling the vagaries of an asynchronous system.
Ok, so I also agree that these errors don’t belong in the Juju log, but more because I think it will be incumbent on the charm author to handle them. If we encourage them to handle it using guard clauses, I think that will be fine in most cases, though there’s a small chance of a race condition. However, in either case, if they don’t do the check, something will fail that they don’t expect and an error will end up in the log. Ideally, that error quickly and easily points them to what they need to do to fix it, and I think that a stack-trace with a
PebbleNotReady
error at the bottom makes it very clear what went wrong and how to fix it. And, in the end, I think the “I forgot to check it” case is probably going to end up with aPebbleNotReady
or some other exception being thrown anyway from whatever unguarded code tries and fails to talk to Pebble.