[bug] workflow resource leaking when no run resource found in KFP DB
See original GitHub issueWhat steps did you take
-
connect to kfp-standalone-1 cluster in kfp-ci project
-
count current workflows – 1252
kubectl get workflow | wc -l 1252
-
confirm current workflow ages:
kubectl get workflow | less
What happened:
I found many workflows with age greater than 1d, our configured workflow GC time. Because of the issue, there are too many Pods on each node and crashing GKE metrics server.
What did you expect to happen:
Workflows should be GCed after being persisted to KFP DB.
Environment:
- How do you deploy Kubeflow Pipelines (KFP)? standalone
- KFP version: 1.7.0-rc.2
Labels
/area backend
/area testing
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Resource leak detection in Amazon CodeGuru Reviewer
Resource leaks are bugs that arise when a program doesn't release the resources it has acquired. Resource leaks can lead to resource exhaustion....
Read more >Resource Leaks: Detecting, Locating, and Repairing Your ...
This article introduces three tools I wrote that will help you detect and find the leaking resource. First, for Windows 2000, I present...
Read more >Database Connection Monitoring and Leak Detection - Joget
At the platform level, Joget Workflow has been tested to ensure that there are no leaks in memory or database connections and other ......
Read more >News — Rok 1.5.3 documentation
Restructure the “Deploy Rok Registry” guide. Bug Fixes¶. Fix a bug in rok-kf-prune which resulted in it removing resources cert- ...
Read more >Java Memory Leaks - AppDynamics Documentation
However, because garbage collection does not eliminate memory leaks completely, AppDynamics includes Automatic Leak Detection for supported JVMs.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This is happening more and more frequently for my team - multiple times a week now. It’s disruptive for us, because oncall needs to look up what workflow was lost and notify users that their run is never going to work.
Are there any workarounds we could try?
It was just confusing, and caused some delay for me.
I launched a workflow, then came to check on it hours later, but found that it hadn’t run.
The run details page wouldn’t load (constant spinner over the dag part of the page). I checked the experiment page, and my run was there but with a grey question mark status icon. When I looked at workflow objects on k8s, I couldn’t find anything.