Katib v1alpha2 pytorchjob-example.yaml is failing
See original GitHub issueKatib v1alpha2 pytorchjob-example.yaml is failing.
Please find the log as below:
asis@paisky2:~/katib/katib/examples/v1alpha2$ kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/pytorchjob-example.yaml
experiment.kubeflow.org/random-experiment created
(reverse-i-search)`exper': kubectl -n kubeflow delete ^Cperiment pytorchjob-example
asis@paisky2:~/katib/katib/examples/v1alpha2$ kubectl get experiment -n kubeflow
NAME STATUS AGE
random-experiment Running 14s
asis@paisky2:~/katib/katib/examples/v1alpha2$ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
katib-controller-7f985d6cd6-6cht6 1/1 Running 17 19h
katib-db-b48df7777-f626v 1/1 Running 0 19h
katib-manager-7946dd5984-2pszt 1/1 Running 8 19h
katib-manager-rest-647f694b7d-c99vc 1/1 Running 0 19h
katib-suggestion-bayesianoptimization-94c87dd64-gwmps 1/1 Running 0 19h
katib-suggestion-grid-58d9dfb5fd-ltw9r 1/1 Running 0 19h
katib-suggestion-hyperband-778bb768c8-5k4hv 1/1 Running 0 19h
katib-suggestion-nasrl-d84fbb8f4-szzmr 1/1 Running 0 19h
katib-suggestion-random-7f96c4d77b-llgb5 1/1 Running 0 19h
katib-ui-6cf97db464-85ndk 1/1 Running 0 19h
random-experiment-8xsvj587-1564470900-rwwnt 0/1 Error 0 21s
random-experiment-dv2fzqvf-1564470900-ggkdv 0/1 Error 0 21s
random-experiment-fbjlt5tq-1564470900-7mc9x 0/1 Error 0 21s
asis@paisky2:~/katib/katib/examples/v1alpha2$ kubectl -n kubeflow logs random-experiment-8xsvj587-1564470900-rwwnt
I0730 07:15:05.651777 1 main.go:61] Experiment Name: random-experiment, Trial Name: random-experiment-8xsvj587, Job Kind: PyTorchJob
F0730 07:15:08.701325 1 main.go:75] Failed to collect logs: No Pods are found in Trial random-experiment-8xsvj587
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc000392500, 0xc00015e1c0, 0x78, 0xe0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb1
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0x1e83b20, 0xc000000003, 0xc0003676c0, 0x1e0ea00, 0x7, 0x4b, 0x0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x25e
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0x1e83b20, 0x7ffe00000003, 0x1255157, 0x1a, 0xc0002a1f18, 0x1, 0x1)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x14e
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
/go/src/github.com/kubeflow/katib/cmd/metricscollector/v1alpha2/main.go:75 +0x4e7
asis@paisky2:~/katib/katib/examples/v1alpha2$
- Kubeflow version: 0.6.1
- Kubernetes version: (use
kubectl version
): v1.15.1 - OS (e.g. from
/etc/os-release
): Ubuntu 18.04.2 LTS
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
Hyperparameter Tuning (Katib) - Kubeflow
yaml. In this demo, hyperparameters are embedded as args. You can embed hyperparameters in another way (for example, environment values) by ...
Read more >How Katib tunes hyperparameter automatically in a ... - Medium
After install Katib v1alpha3, you can run kubectl apply -f katib/examples/v1alpha3/random-example.yaml to try the first example of Katib.
Read more >Kubeflow 1.0 기능 #5 (KFServing, TFServing)
✓ An Istio DestinationRule is for doing traffic splitting. - TFServing 배포. The example contains three configurations for Google Cloud Storage ...
Read more >Train a model - | notebook.community
Train a model. This notebook can be used to train a model. The notebook assumes you have already computted the embeddings and stored...
Read more >Katib - Overview of Trial Templates - 《Kubeflow v1.2 ... - 书栈网
The template should be a valid YAML. Check the grid example. configMap - Kubernetes ConfigMap specification where the experiment's trial template is located ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
you could install pytorch crd and operators from https://github.com/kubeflow/manifests/tree/master/pytorch-job using Kustomize
@gaocegege: Closing this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.