Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for draining node during pod termination

See original GitHub issue

Please describe your use case / problem.

We have a legacy very-stateful application (tomcat/java), which needs sticky sessions. When we deploy new versions of our applications, or a scale down event happens, we need to stop sending new connections to a server, while sending bound sessions to the old server. Please note: this is not referring to in-flight requests, we’re needing the active tomcat sessions to expire, which normally takes about an hour.

We have this working currently in Kubernetes with HAProxy-Ingress by setting its drain-support flag, which drains a pod when it transitions to Terminating, but keeps existing sessions attached to that pod. We then have a preStop hook which blocks the shutdown until users have finished their sessions.

When looking at Ambassador to fulfill this ingress role, we see the pod being removed from routing as soon as it transitions to Terminating. This was tested with a simple application returning the hostname of the pod:

Getting two sessions against different pods - Pod A and Pod B
Scale the deployment down to 1 - Pod A moved to Terminating, Pod B still active
Session immediately switched over to the remaining pod - that is, Pod A session is dropped

Describe the solution you’d like

We’d love for ambassador to drain a pod when it is in Terminating, and keep existing connections alive, whilst routing new connections to other active Pods.

Additional context

Looking at the Kubernetes docs, it looks like they are dropping the pods from the Service registry, as soon as they pod moves to terminating. As per: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

3. Pod shows up as “Terminating” when listed in client commands

4. (simultaneous with 3) When the Kubelet sees that a Pod has been marked as terminating because the time in 2 has been set, it begins the pod shutdown process.
...
5. (simultaneous with 3) Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication controllers. Pods that shutdown slowly cannot continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.

Test Details

Ambassador version: 0.60.1

Ambassador Service Configuration

getambassador.io/config: |
      ---
      apiVersion: ambassador/v1
      kind: KubernetesEndpointResolver
      name: my-resolver
      ---
      apiVersion: ambassador/v1
      kind:  Module
      name:  ambassador
      config:
        resolver: my-resolver
        load_balancer:
          policy: round_robin

Ambassador Target Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: sticky
  labels:
    app: sticky
  annotations:
    getambassador.io/config: |
      ---
      apiVersion: ambassador/v1
      kind:  Mapping
      name:  sticky_mapping
      prefix: /sticky/
      service: sticky
      resolver: my-resolver
      load_balancer:
        policy: maglev
        cookie:
          name: sticky-cookie
          ttl: 300s
spec:
  ports:
  - name: http
    port: 80
  selector:
    app: sticky
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sticky
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sticky
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        app: sticky
    spec:
      terminationGracePeriodSeconds: 120
      containers:
      - name: hello
        image: nginxdemos/hello
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sleep
              - '60'
        ports:
        - containerPort: 80

Issue Analytics

State:
Created 4 years ago
Reactions:14
Comments:34 (14 by maintainers)

Top GitHub Comments

1reaction

mohitreddy1996commented, May 31, 2022

Hi, any updates if this is supported on the non-legacy versions?

If not, curious to understand if switching to LEGACY mode will have any consequences.

1reaction

marianafrancocommented, Dec 5, 2021

@rbtcollins Thanks for this fix! I tested it in one of our staging environments and it’s working (no 5XXs during pod restarts/termination) when AMBASSADOR_LEGACY_MODE=true is set.

Unfortunately, I was mistaken and we are not using the AMBASSADOR_LEGACY_MODE in production today. Do you know what would be the downside of move back to the AMBASSADOR_LEGACY_MODE vs the default one (we are using v1.14.2)?