Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] workflow resource leaking when no run resource found in KFP DB

See original GitHub issue

What steps did you take

connect to kfp-standalone-1 cluster in kfp-ci project
count current workflows – 1252
```
kubectl get workflow | wc -l
    1252
```
confirm current workflow ages:
```
kubectl get workflow | less
```

What happened:

I found many workflows with age greater than 1d, our configured workflow GC time. Because of the issue, there are too many Pods on each node and crashing GKE metrics server.

What did you expect to happen:

Workflows should be GCed after being persisted to KFP DB.

Environment:

How do you deploy Kubeflow Pipelines (KFP)? standalone

KFP version: 1.7.0-rc.2

Labels

/area backend

/area testing

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

jlicommented, Jun 27, 2022

This is happening more and more frequently for my team - multiple times a week now. It’s disruptive for us, because oncall needs to look up what workflow was lost and notify users that their run is never going to work.

Are there any workarounds we could try?

0reactions

jlicommented, Feb 18, 2022

It was just confusing, and caused some delay for me.

I launched a workflow, then came to check on it hours later, but found that it hadn’t run.

The run details page wouldn’t load (constant spinner over the dag part of the page). I checked the experiment page, and my run was there but with a grey question mark status icon. When I looked at workflow objects on k8s, I couldn’t find anything.