question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Doesn't change status tensorflow example (tfjob-example)

See original GitHub issue

tfjob-example.yaml in the v1alpha3 example on OSX vagrant minikf, the running status will not change after deployment. So, there is no result.

$ kubectl get trials -n kubeflow

NAME                STATUS    AGE
tfjhsjhs-rlcczvwm   Running   15h

With maxTrialCount set to 1 and deploying, the TFJob container’s log shows accuracy information, but kaib is still running satus.

Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz Accuracy at step 0: 0.1462 Accuracy at step 10: 0.6781 Accuracy at step 20: 0.8479 Accuracy at step 30: 0.8767 Accuracy at step 40: 0.8919 Accuracy at step 50: 0.8997 Accuracy at step 60: 0.9183 Accuracy at step 70: 0.9173 Accuracy at step 80: 0.9241 Accuracy at step 90: 0.9322 Adding run metadata for 99 ... Accuracy at step 900: 0.9498 Accuracy at step 910: 0.9503 Accuracy at step 920: 0.9494 Accuracy at step 930: 0.9422 Accuracy at step 940: 0.9518 Accuracy at step 950: 0.9411 Accuracy at step 960: 0.9504 Accuracy at step 970: 0.9444 Accuracy at step 980: 0.9498 Accuracy at step 990: 0.949 Adding run metadata for 999

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
hyeonsangjeoncommented, Mar 12, 2020

After upgrading the VM, it works well on v1alpha3 images. Thank you. Below are vagrant katib version and container version.

[vagrant] rrikto/minikf (virtualbox, 20200305.0.1) [katib docker]

  • gcr.io/kubeflow-images-public/katib/v1alpha3/katib-ui v0.8.0 RepoDigests: sha256:984eda360f3de59b5cd53eab48700780052bf97ec02d3366f681f4a15cff6d1d 540d9308c9f6 4 weeks ago 54.4MB

  • gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector v0.8.0 sha256:265c7a2a3a3da83637d098dae605bc37d95b574b42f88041c19b685292877af5 b9956282b11e 4 weeks ago 1.31GB

  • gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller v0.8.0 sha256:1f2ad821a913f19ebc200b54bd59b83ac65e782c8fbc5946c7db7f8aa9db0362 7c5162abd775 4 weeks ago 53.8MB

  • gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt v0.8.0 sha256:803da644fd23bbca068d0b7f309e0b29e7c50cfa25c1c9cd89c3f76b65920019 56c0051f100c 4 weeks ago 1.23GB

  • gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager v0.8.0 sha256:60ace3d4dbb66eb346ee19b5ad08a998ecf668e68cf699065aedb623c48fe767 32229959fe81 4 weeks ago 28.5MB

kubectl get experiment -n kubeflow NAME STATUS AGE tfjob-example Succeeded 43m

kubectl get trials -n kubeflow NAME TYPE STATUS AGE tfjob-example-m7mp8vtp Succeeded True 43m

1reaction
hyeonsangjeoncommented, Mar 4, 2020

There are 2 versions test in katib, OSX Vagrant, windows hyper-v. Both katib does not work tfjob-example and remain running status on the UI, but the container log is slightly different.

cf : random-example.yaml run well both of envs. / pytorchjob-example seems not working either.

1. vagrant virtualbox (arrikto/minikf)

The docker image of the katib vagrant version was v1alpha2 built in.

When at the status timeline of a container that has run a job with a trial count of 1,

vagrant@minikf:~$ kubectl get pods -A | grep yee
kubeflow yee-f8lkmvk7-worker-0 1/1 Running 0 6s

vagrant@minikf:~$ kubectl get pods -A | grep yee
kubeflow yee-f8lkmvk7-worker-0 0/1 Completed 0 3m8s

vagrant@minikf:~$ kubectl describe pods yee-f8lkmvk7-worker-0 n kubeflow

~~~~...
vents:
Type Reason Age From Message
--- ------ ---- ---- -------
Normal Scheduled 5m52s default-scheduler Successfully assigned kubeflow/yee-f8lkmvk7-worker-0 to minikube
Normal Pulled 5m50s kubelet, minikube Container image "ssamoilenko/startup-lock-init" already present on machine
Normal Created 5m50s kubelet, minikube Created container startup-lock-init-container
Normal Started 5m50s kubelet, minikube Started container startup-lock-init-container
Normal Pulled 5m49s kubelet, minikube Container image "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0" already present on machine
Normal Created 5m49s kubelet, minikube Created container tensorflow
Normal Started 5m49s kubelet, minikube Started container tensorflow

image ver info.

RepoTags(All images below) : v0.6.0-rc.0

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-ui RepoDigests : sha256:d38071758ea0d60a241cc1f75ac447bf7aa8d18667e9b0883bf6e8724b69948a

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-controller
    RepoDigests :sha256:d4773b412e198a656835e61c55789a79876ad0caae79f26fa5e914cc01bf4531

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-nasrl
    RepoDigests :sha256:95f3565a5af4bda4bbe395b95f5f90cfc0f48ec09b07e5e82f3ca3acab1b867c

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-random
    RepoDigests :sha256:ea2f342ebba83c815ae33817b8c4d2c34cdec70c21771c41eaafb5d315e610ad

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/bayesianoptimization
    RepoDigests :sha256:c5c3605c383b59c84f8e18292538ad064f42ecc3d34655127be18e0162fdc013

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-hyperband
    RepoDigests :sha256:5f7bc0cf1179152f04b77c902d111ea7b636dd858979cd85f4b887880d6f0fdb

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-grid
    RepoDigests :sha256:5639d6d8dd882d7829ead58f7c82235a39cf37fda509c17677732f1eb9e79da2

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager
    RepoDigests :sha256:8dbe595c3a241ce65d29afb87a99453461b2c82338e54135dc8dfb4cb5ac8fa6

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager-rest
    RepoDigests :sha256:304070956d976385991bfba8d0f30172f6910c7017f0c740121cdae5dae6a53e

  • img : gcr.io/kubeflow-images-public/katib/v1alpha2/metrics-collector
    RepoDigests :sha256:c66eb35d6d73bcb61064f0a94cea45dc0cf13a80fb002a5117efa43d9e60d968


2. windows hyper-v : latest

In the running state, katib latest ver’s window tensorflow example showed an error log.

container status kubeflow jhstf-random-bc59dd654-sm5lg 1/1 Running 0 2m56s

container log

INFO:hyperopt.utils:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.fmin:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
ERROR:grpc._server:Exception calling application: Method not implemented!
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/grpc/_server.py", line 434, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1alpha3/python/api_pb2_grpc.py", line 135, in ValidateAlgorithmSettings
    raise NotImplementedError('Method not implemented!')
NotImplementedError: Method not implemented!

image ver info. windows hyper-v

RepoTags(All images below) : latest

  • img : gcr.io/kubeflow-images-public/katib/v1alpha3/katib-ui RepoDigests:sha256:ff5ca291ba5bd0f515e2900abb59d522219114ac8d6ddcd1cc9d89905252c001

  • img : gcr.io/kubeflow-images-public/katib/v1alpha3/file-metrics-collector RepoDigests:sha256:sha256:48c46186c155ce7975252d13abf6261d0f6182f6b87ee5f85912744ed90ef8c9

  • img : gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller RepoDigests:sha256:sha256:531df1dc2ac8813436eb67895ae62fe2f3f5440df820f8a68d76db567ac21aa4

  • img : gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt RepoDigests:sha256:sha256:20b6bdd3fd8171f6a838ad0c480d83349947cb6572695704787d2761700cecf2

  • img : gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager RepoDigests:sha256:sha256:416109f83816bace0f6176c2185279b15c50f28a8c7c5d38caebe044e65e1073

Is it caused by katib v1alpha2 version in vagrant? In the Windows environment, the latest container iamge ver. results are slightly different but still running status with error log. If this is not a bug, Can you let me know if I’m testing in the wrong version environment .Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

TensorFlow Training (TFJob)
Running the Mnist example. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.
Read more >
Distributed Multi-worker TensorFlow Training on Kubernetes
In this hands-on lab you will explore using Google Cloud Kubernetes Engine and Kubeflow TFJob to scale out TensorFlow distributed training.
Read more >
Troubleshooting
In this example, the job is in the Running state but the pods are in the Error or CrashLoopBackOff state. First start with...
Read more >
Container Service for Kubernetes:Create TensorFlow jobs
You can create TensorFlow jobs on master instances of ACK One in the ... In this example, the job is named pi and...
Read more >
GPU Training - AWS Deep Learning Containers
Make sure that your cluster has GPU nodes before you run the examples. If you do not ... When the status changes to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found