[bug] With Python SDK, TFJob won't stop running
See original GitHub issueA TFJob would not stop running even after the training code exits.
tfjob_client.get_job_status
returns ‘Running’.
tfjob_client.wait_for_job
timeouts
With, tfjob_client.get_logs
broken(#1182), running kubectl -n ${NAMESPACE} logs ${JOB_NAME}-worker-0 -c tensorflow
gives training logs, from which it can be understood that training is over. But the job does not stop.
Running tfjob_client.is_job_succeeded
gives False
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
[bug] With Python SDK, TFJob won't stop running · Issue #1183
A TFJob would not stop running even after the training code exits. tfjob_client.get_job_status returns 'Running'.
Read more >TensorFlow Training (TFJob) - Kubeflow
This page describes TFJob for training a machine learning model with TensorFlow. What is TFJob? TFJob is a Kubernetes custom resource to run...
Read more >Change Log - Neural Network Intelligence
Add resume and view mode in Python API nni.experiment (#3490 #3524 #3545) ... When a trial is completed, the OpenPAI job won't stop,...
Read more >Deploy Tfjob Operator using Charmhub
The TFJob Operator is a Python script that wraps the latest released TFJob Operator manifest, providing lifecycle management and handling ...
Read more >Changelog — Rok 2.0 documentation
Fix authorization bug in the KFP API server when creating a pipeline via URL. Resolve CVEs in the KFAM image. Upgrade cert-manager to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@yashjakhotiya Thanks the deep checking. how about the status by
kubectl get tfjob ${JOB_NAME} -n ${NAMESPACE}
. I think the Python API should be consistent with backend CLI output, otherwise that’s bug.Turned off istio for TFJob pods with
The TFJob stopped as expected. Turns out the istio sidecar running has caused problems for this as well as the logs issue (#1182). Closing this issue now.