question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve sidecar mode for Kubernetes

See original GitHub issue

Improve sidecar mode for Kubernetes

The docs states that sidecar mode is Kubernetes Friendly. The purpose of this ticket is to highlight and address some problems with Reaper sidecar in Kubernetes.

I added initial support for Reaper sidecar mode in Cass Operator, a Kubernetes operator for Cassandra. The initial integration does not work when Cassandra authentication is enabled (see this issue for details). While working on the fix, I have encountered problems in the following areas:

  • health probe
  • startup initialization code
  • schema updates

In Kubernetes, Cassandra is deployed in a StatefulSet. A StatefulSet consists of one or more pods. A pod can have multiple application containers. When Cass Operator deploys Reaper, each Cassandra pod includes a cassandra container as well as a reaper container. Containers can have readiness and liveness probes. Kubernetes uses the readiness probe to determine when a container is ready to start serving requests. It uses the liveness probe to determine whether the container needs to be restarted (see here for more details about Kubernetes probes).

I am using Reaper’s healthcheck endpoint for both the liveness and readiness probes. This endpoint is handled by the class ReaperHealthCheck. ReaperHealthCheck first checks that it is connected to the backend storage and then queries it.

Here is where things get interesting. The startup order of application containers in a pod is effectively non-deterministic. Cassandra can start before Reaper and vice versa. A pod is considered ready when all of its application containers report ready. For the Cassandra pod this means that both the cassandra and the reaper containers need to report ready in order for the pod to be considered ready.

Cass Operator create roles in Cassandra, including one for Reaper, but they do not get created until the C* cluster is ready. The reaper container cannot reach the ready state because it does not have credentials with which to connect to Cassandra. And the pods cannot reach the ready state since the reaper containers cannot reach the ready state.

Initially I updated Cass Operator so that it doe the following:

  • Deploy Cassandra without Reaper
  • Wait for Cassandra to be ready
  • Create the roles
  • Update StatefulSet definition to include Reaper

From a user perspective this looks like a cluster-wide rolling restart. In my testing thus far I found that this approach works, but it is suboptimal. It really slows down the overall deployment time. I want to deploy Reaper at the same time as Cassandra and avoid having to do a rolling restart.

There is another issue with which I have been struggling around appyling schema changes. The schema migration library that Reaper uses implements coordination via table-based locks. The locks are based on IP addresses. The readiness and/or liveness probes frequently fail before all schema changes are applied. Kubernetes restarts the container, and because it is assigned a different IP address, the reaper container cannot immediately reacquire the lock. This results in an ugly cycle of container restarts before things eventually stabalize.

I want to make the following changes to address these problems:

  • Change ReaperHealthCheck such that it only checks if Reaper is running and do not check storage connectivity
  • Change startup/initialization code so that it does not block waiting to connect to Cassandra
  • Make it easier to apply schema changes

Note that the changes I am proposing are only needed for the sidecar scenario in Kubernetes. If Reaper is deployed in a different mode, these issues can be avoided.

Reaper will be reporting ready/healthy even before it is fully initialized. This will help avoid k8s liveness and readiness probe failures that can occur while Reaper waits for Cassandra to come up. In order for the /healthcheck endpoint to be available though, startup cannot block waiting to establish a connection to the storage backend. Some initialization will have to be done in a background thread.

If a client invokes an endpoint before Reaper has comepleted initialization tasks, Reaper should fail gracefully. We can use a JAX-RS filter to intercept all endpoints and return an appropriate error in this situation.

We need to make it easier to apply schema changes. To be honest, I don’t think that the coordination provided by the schema migration library is very useful. We have to first create the reaper_db keyspace before we can apply any schema changes. We need some form of coordination for this. In Kubernetes, I create the keyspace with a Job. There is guaranteed to be only a single instance of the Job running, which means no lock contetion. I would like to reuse and extend that Job to also apply the schema changes. I am currently doing this with a modified version of Reaper that exits after applying the schema changes. This is a rather heavyweight solution though. I think we really just need to run CassandraStorage.initializeAndUpgradeSchema().

I am propsing these changes within the context of the work I am doing for Cass Operator. There are several other Cassandra operator projects. It is worth pointing out that these changes are not specific to Cass Operator. They will benefit other operators that also want to run Reaper in sidecar mode.

┆Issue is synchronized with this Jira Task by Unito ┆Issue Number: K8SSAND-454

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
adejanovskicommented, Mar 24, 2022

@jsanda, I assume we can close this issue now. Re-open it if you think there’s still something to address here.

1reaction
adejanovskicommented, Feb 22, 2021

yes, it’s an upgrade “feature” 😅 Truncate the running_reapers table and then you’ll be good to go.

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenKruise: A Powerful Tool for Sidecar Container ...
This article explains how OpenKruise SidecarSet helps large-scale implementation of the Sidecar log architecture in Kubernetes clusters.
Read more >
Kubernetes Sidecar Container | Best Practices and Examples
Sidecar containers allow you to enhance and extend the functionalities of the main container without having to modify its codebase.
Read more >
Kubernetes — Learn Sidecar Container Pattern - Medium
Sidecar containers are the containers that should run along with the main container in the pod. This sidecar pattern extends and enhances the ......
Read more >
Best Practice for Managing Log Collection Sidecar Containers
The Sidecar approach deploys a separate logging agent for each POD, which is relatively more resource intensive, but more flexible and multi- ...
Read more >
How to debug microservices in Kubernetes with proxy, sidecar ...
Sidecar : Inject a sidecar into the pod of the microservice to be debugged to intercept all traffic to and from the service,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found