Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to undeploy app after deployment with failing healthcheck

See original GitHub issue

Hello, I’m running SCDF on k8s with helm and found issue looking to me as a bug.

k8s version - 1.16 (AWS EKS eks.1)
helm chart - https://hub.helm.sh/charts/stable/spring-cloud-data-flow/2.7.1

Steps to reproduce:

Create simplest stream with time and log : “time | log”
Change liveness-probe-path and readiness-probe-path to value that intentionally breaks healthcheck for time app.
Update stream.
Initial version of time app is destroyed and new one is started. After couple of minutes app’s state is changed to ‘failed’ and stream’s status to ‘partial’ (you can try to perform this step couple of times to generate more failing app versions). 5*. If at this stage you try to undeploy the stream or destroy it both deployments for time and log apps are deleted - this is expected (don’t try to remove stream when trying to reproduce the issue, I mentioned this step only to show that deletion of the stream at this stage works and SCDF can delete failing app).
For initial stream existing since step 1 change values for liveness-probe-path and readiness-probe-path back for time app to normal. New version of time is started.
Previous version is still trying to start, running into crashloopbackoff.
Now destroy, or undeploy the stream. Deployments of healthy apps are deleted, but failing ones are preserved. Also tried stream all destroy --force - same results.

If multiple updates are made and app cannot start, all these versions are preserved after stream is destroyed, also occupying resources of k8s cluster. Only manual deletion with kubectl delete deployment appDeploymentName helps.

In skipper logs I can also observe such lines:

2020-07-13 13:40:57.974 INFO 1 — [eTaskExecutor-2] o.s.c.s.s.d.s.HandleHealthCheckStep : Release testTimeLogStream-v2 has been DEPLOYED 2020-07-13 13:40:57.974 INFO 1 — [eTaskExecutor-2] o.s.c.s.s.d.s.HandleHealthCheckStep : Apps in release testTimeLogStream-v2 are healthy. 2020-07-13 13:40:57.984 INFO 1 — [eTaskExecutor-2] o.s.c.s.s.d.s.HandleHealthCheckStep : Deleting changed applications from existing release testTimeLogStream-v1 2020-07-13 13:40:57.995 WARN 1 — [eTaskExecutor-2] o.s.c.s.s.d.strategies.DeleteStep : For Release name testTimeLogStream, did not undeploy existing app time as its status is not ‘deployed’.

So it looks like it’s expected situation with current DeleteStep logic. However to me it looks like bug, as there are not cleaned resources left in k8s but from SCDF perspective stream’s health gets back to healthy.

Let me know if I need to provide more details. But setup is pretty basic and the issue can be reproduced easily.