Doesn't change status tensorflow example (tfjob-example)
See original GitHub issuetfjob-example.yaml in the v1alpha3 example on OSX vagrant minikf, the running status will not change after deployment. So, there is no result.
$ kubectl get trials -n kubeflow
NAME STATUS AGE
tfjhsjhs-rlcczvwm Running 15h
With maxTrialCount set to 1 and deploying, the TFJob container’s log shows accuracy information, but kaib is still running satus.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz Accuracy at step 0: 0.1462 Accuracy at step 10: 0.6781 Accuracy at step 20: 0.8479 Accuracy at step 30: 0.8767 Accuracy at step 40: 0.8919 Accuracy at step 50: 0.8997 Accuracy at step 60: 0.9183 Accuracy at step 70: 0.9173 Accuracy at step 80: 0.9241 Accuracy at step 90: 0.9322 Adding run metadata for 99 ... Accuracy at step 900: 0.9498 Accuracy at step 910: 0.9503 Accuracy at step 920: 0.9494 Accuracy at step 930: 0.9422 Accuracy at step 940: 0.9518 Accuracy at step 950: 0.9411 Accuracy at step 960: 0.9504 Accuracy at step 970: 0.9444 Accuracy at step 980: 0.9498 Accuracy at step 990: 0.949 Adding run metadata for 999
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
After upgrading the VM, it works well on v1alpha3 images. Thank you. Below are vagrant katib version and container version.
[vagrant] rrikto/minikf (virtualbox, 20200305.0.1) [katib docker]
gcr.io/kubeflow-images-public/katib/v1alpha3/katib-ui v0.8.0 RepoDigests: sha256:984eda360f3de59b5cd53eab48700780052bf97ec02d3366f681f4a15cff6d1d 540d9308c9f6 4 weeks ago 54.4MB
gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector v0.8.0 sha256:265c7a2a3a3da83637d098dae605bc37d95b574b42f88041c19b685292877af5 b9956282b11e 4 weeks ago 1.31GB
gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller v0.8.0 sha256:1f2ad821a913f19ebc200b54bd59b83ac65e782c8fbc5946c7db7f8aa9db0362 7c5162abd775 4 weeks ago 53.8MB
gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt v0.8.0 sha256:803da644fd23bbca068d0b7f309e0b29e7c50cfa25c1c9cd89c3f76b65920019 56c0051f100c 4 weeks ago 1.23GB
gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager v0.8.0 sha256:60ace3d4dbb66eb346ee19b5ad08a998ecf668e68cf699065aedb623c48fe767 32229959fe81 4 weeks ago 28.5MB
kubectl get experiment -n kubeflow
NAME STATUS AGE tfjob-example Succeeded 43mkubectl get trials -n kubeflow
NAME TYPE STATUS AGE tfjob-example-m7mp8vtp Succeeded True 43mThere are 2 versions test in katib, OSX Vagrant, windows hyper-v. Both katib does not work tfjob-example and remain running status on the UI, but the container log is slightly different.
cf : random-example.yaml run well both of envs. / pytorchjob-example seems not working either.
1. vagrant virtualbox (arrikto/minikf)
The docker image of the katib vagrant version was v1alpha2 built in.
When at the status timeline of a container that has run a job with a trial count of 1,
image ver info.
RepoTags(All images below) : v0.6.0-rc.0
img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-ui RepoDigests : sha256:d38071758ea0d60a241cc1f75ac447bf7aa8d18667e9b0883bf6e8724b69948a
img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-controller
RepoDigests :sha256:d4773b412e198a656835e61c55789a79876ad0caae79f26fa5e914cc01bf4531
img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-nasrl
RepoDigests :sha256:95f3565a5af4bda4bbe395b95f5f90cfc0f48ec09b07e5e82f3ca3acab1b867c
img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-random
RepoDigests :sha256:ea2f342ebba83c815ae33817b8c4d2c34cdec70c21771c41eaafb5d315e610ad
img : gcr.io/kubeflow-images-public/katib/v1alpha2/bayesianoptimization
RepoDigests :sha256:c5c3605c383b59c84f8e18292538ad064f42ecc3d34655127be18e0162fdc013
img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-hyperband
RepoDigests :sha256:5f7bc0cf1179152f04b77c902d111ea7b636dd858979cd85f4b887880d6f0fdb
img : gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-grid
RepoDigests :sha256:5639d6d8dd882d7829ead58f7c82235a39cf37fda509c17677732f1eb9e79da2
img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager
RepoDigests :sha256:8dbe595c3a241ce65d29afb87a99453461b2c82338e54135dc8dfb4cb5ac8fa6
img : gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager-rest
RepoDigests :sha256:304070956d976385991bfba8d0f30172f6910c7017f0c740121cdae5dae6a53e
img : gcr.io/kubeflow-images-public/katib/v1alpha2/metrics-collector
RepoDigests :sha256:c66eb35d6d73bcb61064f0a94cea45dc0cf13a80fb002a5117efa43d9e60d968
2. windows hyper-v : latest
In the running state, katib latest ver’s window tensorflow example showed an error log.
container status
kubeflow jhstf-random-bc59dd654-sm5lg 1/1 Running 0 2m56s
container log
image ver info. windows hyper-v
RepoTags(All images below) : latest
img : gcr.io/kubeflow-images-public/katib/v1alpha3/katib-ui RepoDigests:sha256:ff5ca291ba5bd0f515e2900abb59d522219114ac8d6ddcd1cc9d89905252c001
img : gcr.io/kubeflow-images-public/katib/v1alpha3/file-metrics-collector RepoDigests:sha256:sha256:48c46186c155ce7975252d13abf6261d0f6182f6b87ee5f85912744ed90ef8c9
img : gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller RepoDigests:sha256:sha256:531df1dc2ac8813436eb67895ae62fe2f3f5440df820f8a68d76db567ac21aa4
img : gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt RepoDigests:sha256:sha256:20b6bdd3fd8171f6a838ad0c480d83349947cb6572695704787d2761700cecf2
img : gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager RepoDigests:sha256:sha256:416109f83816bace0f6176c2185279b15c50f28a8c7c5d38caebe044e65e1073
Is it caused by katib v1alpha2 version in vagrant? In the Windows environment, the latest container iamge ver. results are slightly different but still running status with error log. If this is not a bug, Can you let me know if I’m testing in the wrong version environment .Thanks.