question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] With Python SDK, TFJob won't stop running

See original GitHub issue

A TFJob would not stop running even after the training code exits.

tfjob_client.get_job_status returns ‘Running’. tfjob_client.wait_for_job timeouts With, tfjob_client.get_logs broken(#1182), running kubectl -n ${NAMESPACE} logs ${JOB_NAME}-worker-0 -c tensorflow gives training logs, from which it can be understood that training is over. But the job does not stop.

Running tfjob_client.is_job_succeeded gives False

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jinchihecommented, Jul 13, 2020

@yashjakhotiya Thanks the deep checking. how about the status by kubectl get tfjob ${JOB_NAME} -n ${NAMESPACE}. I think the Python API should be consistent with backend CLI output, otherwise that’s bug.

0reactions
yashjakhotiyacommented, Jul 24, 2020

Turned off istio for TFJob pods with

template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            annotations={'sidecar.istio.io/inject':'false'}
        ),
        spec=V1PodSpec(
            containers=[container]
        )
    )

The TFJob stopped as expected. Turns out the istio sidecar running has caused problems for this as well as the logs issue (#1182). Closing this issue now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[bug] With Python SDK, TFJob won't stop running · Issue #1183
A TFJob would not stop running even after the training code exits. tfjob_client.get_job_status returns 'Running'.
Read more >
TensorFlow Training (TFJob) - Kubeflow
This page describes TFJob for training a machine learning model with TensorFlow. What is TFJob? TFJob is a Kubernetes custom resource to run...
Read more >
Change Log - Neural Network Intelligence
Add resume and view mode in Python API nni.experiment (#3490 #3524 #3545) ... When a trial is completed, the OpenPAI job won't stop,...
Read more >
Deploy Tfjob Operator using Charmhub
The TFJob Operator is a Python script that wraps the latest released TFJob Operator manifest, providing lifecycle management and handling ...
Read more >
Changelog — Rok 2.0 documentation
Fix authorization bug in the KFP API server when creating a pipeline via URL. Resolve CVEs in the KFAM image. Upgrade cert-manager to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found