trails succeeding but experiment hangs as metrics not collected
See original GitHub issue/kind bug
What steps did you take and what happened: created an experiment with stdout metrics collector, the trail container completes, i see the reqauired metrics printed to stdout, but it looks like the metrics collector sidecar is not injected and experiment hangs on the first n trails
What did you expect to happen: the trials would complete, the metrics would be collected and reported back , the next n trails would start
Anything else you would like to add:
Environment:
- Katib version (check the Katib controller image version): release-0.12
- Kubernetes version: (
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.9-dispatcher", GitCommit:"2a8027f41d28b788b001389f3091c245cd0a9a60", GitTreeState:"clean", BuildDate:"2022-01-21T20:26:49Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6-gke.1500", GitCommit:"7ce0f9f1939dfc1aee910732e84cba03840df91e", GitTreeState:"clean", BuildDate:"2021-11-17T09:30:26Z", GoVersion:"go1.16.9b7", Compiler:"gc", Platform:"linux/amd64"}
- OS (
uname -a
): Container-Optimized OS from Google
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Vanity metrics in Experimentation Programs Pt.1
Experimentation teams are stuck on a hamster wheel as a result of decisions made in the setup and rollout of testing across the...
Read more >Source code for ray.tune.analysis.experiment_analysis
"training_iteration" is used by default if no value was passed to ``self.default_metric``. Returns: List of [path, metric] for all persistent checkpoints of the ......
Read more >The Importance of Implementing Effective Metrics - iSixSigma
One way to keep metrics understandable is to use the SMART (specific, measurable, achievable, relevant, time-based) model. The Achievable step in this model...
Read more >Machine Learning Experiment Management: How to Organize ...
Machine learning or deep learning experiment tracking is a key factor in delivering successful outcomes. There's no way you will succeed without it....
Read more >The No Jargon Guide to Understanding A/B Testing Metrics
Even if you're convinced you want to run experiments, it may feel like those in the know are gatekeeping information. But we're determined ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@iantowey Try to specify
imagePullPolicy: Always
in the Katib controller and redeploy the controller. Maybe the image has been cached on your cluster.I believe the defaulter webhook is working, because I can see metrics collector spec in your Experiment:
Since the reported issue is unrelated to Katib, closing this issue
/close