random-example cannot work
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
In katib web ui, I submitted https://github.com/kubeflow/katib/blob/7443f02c21/examples/v1alpha3/random-example.yaml as an experiment.
What did you expect to happen: This example works well.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] In each trial, the pod panic.
Error logs of each trial pod:
I1130 23:53:52.422917 18 main.go:78] INFO:root:Epoch[19] Train-accuracy=0.122044
I1130 23:53:52.422934 18 main.go:78] INFO:root:Epoch[19] Time cost=3.282
I1130 23:53:52.550241 18 main.go:78] INFO:root:Epoch[19] Validation-accuracy=0.113854
F1130 23:53:53.003408 18 main.go:94] Failed to wait for worker container: Process 6 hadn't completed: open /var/log/katib/6.pid: no such file or directory
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc000186100, 0xc000250000, 0xa0, 0x256)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb8
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0x129ca40, 0xc000000003, 0xc000210000, 0x1236476, 0x7, 0x5e, 0x0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2d0
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0x129ca40, 0x3, 0xc77e24, 0x27, 0xc00008dee8, 0x1, 0x1)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x14b
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
/go/src/github.com/kubeflow/katib/cmd/metricscollector/v1alpha3/file-metricscollector/main.go:94 +0x279
Images I use:
docker images | grep suggestion
gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt latest 989d1ed70824 5 days ago 1.22GB
All other katib components are using images with tag v0.7.0.
Environment:
- Kubeflow version:
- Minikube version:
- Kubernetes version: (use
kubectl version
): 1.16.3 - OS (e.g. from
/etc/os-release
):
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Getting random numbers in Java [duplicate] - Stack Overflow
The first solution is to use the java.util.Random class: import java.util.Random; Random rand = new Random(); // Obtain a number between [0 -...
Read more >Math.random() - JavaScript - MDN Web Docs
The implementation selects the initial seed to the random number generation algorithm; it cannot be chosen or reset by the user.
Read more >Java Math random() method with Examples - GeeksforGeeks
Return Type: This method returns a pseudorandom double greater than or equal to 0.0 and less than 1.0. Example 1:To show the working...
Read more >random — Generate pseudo-random numbers — Python 3.11 ...
This is equivalent to choice(range(start, stop, step)) , but doesn't actually ... before making selections, so supplying the cumulative weights saves work.
Read more >Random Class (System) - Microsoft Learn
To avoid this problem, create a single Random object instead of multiple objects. ... Open); BinaryReader bin = new BinaryReader(fs); int seed =...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah, the metrics collector captures the logs. The problem I had was that the limits were too wide so the Pod didn’t get OOMKilled but the node had SystemOOM warnings and SIGKILled the container.
@janvdvegt So on your training job you can see metrics collector container? Try to increase resources for your training job, maybe it helps.