MetricsUnavailable for random example experiment
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.] Installed Kubeflow on a clean EKS cluster using this guide https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/
Submitted random-experiment
from UI.
After an hour there is still no status, logs or metrics.
Pods that were spawned by Trial’s Job have logs.
What did you expect to happen: Metrics to show on Katib UI. Experiment to finish.
Anything else you would like to add:
Pod logs:
INFO:root:Epoch[19] Batch [100] Speed: 27953.82 samples/sec accuracy=0.116646
INFO:root:Epoch[19] Batch [200] Speed: 24880.15 samples/sec accuracy=0.111406
INFO:root:Epoch[19] Batch [300] Speed: 23859.95 samples/sec accuracy=0.112344
INFO:root:Epoch[19] Batch [400] Speed: 27594.30 samples/sec accuracy=0.115937
INFO:root:Epoch[19] Batch [500] Speed: 18158.40 samples/sec accuracy=0.115312
INFO:root:Epoch[19] Batch [600] Speed: 26611.55 samples/sec accuracy=0.102188
INFO:root:Epoch[19] Batch [700] Speed: 27180.25 samples/sec accuracy=0.114687
INFO:root:Epoch[19] Batch [800] Speed: 27309.44 samples/sec accuracy=0.113906
INFO:root:Epoch[19] Batch [900] Speed: 26656.13 samples/sec accuracy=0.105313
INFO:root:Epoch[19] Train-accuracy=0.122044
INFO:root:Epoch[19] Time cost=2.383
INFO:root:Epoch[19] Validation-accuracy=0.113854
Description of a Trial:
Name: hptest-bt8nsw2z
Namespace: kubeflow
Labels: experiment=hptest
Annotations: <none>
API Version: kubeflow.org/v1alpha3
Kind: Trial
Metadata:
Creation Timestamp: 2020-01-28T16:18:39Z
Finalizers:
clean-metrics-in-db
Generation: 1
Owner References:
API Version: kubeflow.org/v1alpha3
Block Owner Deletion: true
Controller: true
Kind: Experiment
Name: hptest
UID: ab0e3811-41e9-11ea-a0cf-0a9ff0751f4a
Resource Version: 127230
Self Link: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/hptest-bt8nsw2z
UID: d810a019-41e9-11ea-a0cf-0a9ff0751f4a
Spec:
Metrics Collector:
Objective:
Additional Metric Names:
accuracy
Goal: 0.99
Objective Metric Name: Validation-accuracy
Type: maximize
Parameter Assignments:
Name: --lr
Value: 0.020744080613308936
Name: --num-layers
Value: 3
Name: --optimizer
Value: sgd
Run Spec: apiVersion: batch/v1
kind: Job
metadata:
name: hptest-bt8nsw2z
namespace: kubeflow
spec:
template:
spec:
containers:
- name: hptest-bt8nsw2z
image: docker.io/katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
- "--lr=0.020744080613308936"
- "--num-layers=3"
- "--optimizer=sgd"
restartPolicy: Never
Status:
Conditions:
Last Transition Time: 2020-01-28T16:18:39Z
Last Update Time: 2020-01-28T16:18:39Z
Message: Trial is created
Reason: TrialCreated
Status: True
Type: Created
Last Transition Time: 2020-01-28T16:19:43Z
Last Update Time: 2020-01-28T16:19:43Z
Message: Trial is running
Reason: TrialRunning
Status: False
Type: Running
Last Transition Time: 2020-01-28T16:19:43Z
Last Update Time: 2020-01-28T16:19:43Z
Message: Metrics are not available
Reason: MetricsUnavailable
Status: False
Type: Succeeded
Start Time: 2020-01-28T16:18:39Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning MetricsUnavailable 31m (x2 over 139m) trial-controller Metrics are not available for Job hptest-bt8nsw2z
Environment:
- Kubeflow version: 7.1 from https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.1.yaml
- Minikube version: N/A. Deployed on EKS.
- Kubernetes version: (use
kubectl version
): version.Info{Major:“1”, Minor:“14+”, GitVersion:“v1.14.9-eks-c0eccc”, GitCommit:“c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2”, GitTreeState:“clean”, BuildDate:“2019-12-22T23:14:11Z”, GoVersion:“go1.12.12”, Compiler:“gc”, Platform:“linux/amd64”} - OS (e.g. from
/etc/os-release
): Amazon Linux (?)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:28 (9 by maintainers)
Top Results From Across the Web
Experimental Design & How to avoid blowing everything up
Example of primary metrics for an e-commerce website: add to cart, quantity added to cart, quantity purchased, average order value.
Read more >A dirty dozen: twelve common metric interpretation pitfalls in ...
These are metrics that can both be measured during the short duration of an experiment, and are also indicative of long term business...
Read more >Online Experiments Tricks – Variance Reduction - TOPBOTS
In this article, I will walk through some of the popular variance reduction methods and demonstrate some simple examples in Python.
Read more >exp show | Data Version Control - DVC
Displays experiments and checkpoints in a detailed table which includes their parent and name (or hash), as well as colored columns for (left...
Read more >Test ads with experiments - Display & Video 360 Help
Evaluate the results of an experiment · Metric: You can evaluate the difference between your baseline and variants to check for statistical significance...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Can you try to do this:
kubectl edit namespace kubeflow
control-plane=kubeflow
It maybe fixes the problem with MetricsCollector. You don’t need to make any changes to Katib examples.
sorry @andreyvelich I misspecified the objective metric name and that was what was causing the issue. Thanks for the help and the rapid responses.