question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MetricsUnavailable for random example experiment

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] Installed Kubeflow on a clean EKS cluster using this guide https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/

Submitted random-experiment from UI.

After an hour there is still no status, logs or metrics.

Pods that were spawned by Trial’s Job have logs.

What did you expect to happen: Metrics to show on Katib UI. Experiment to finish.

Anything else you would like to add:

Pod logs:

INFO:root:Epoch[19] Batch [100]	Speed: 27953.82 samples/sec	accuracy=0.116646
INFO:root:Epoch[19] Batch [200]	Speed: 24880.15 samples/sec	accuracy=0.111406
INFO:root:Epoch[19] Batch [300]	Speed: 23859.95 samples/sec	accuracy=0.112344
INFO:root:Epoch[19] Batch [400]	Speed: 27594.30 samples/sec	accuracy=0.115937
INFO:root:Epoch[19] Batch [500]	Speed: 18158.40 samples/sec	accuracy=0.115312
INFO:root:Epoch[19] Batch [600]	Speed: 26611.55 samples/sec	accuracy=0.102188
INFO:root:Epoch[19] Batch [700]	Speed: 27180.25 samples/sec	accuracy=0.114687
INFO:root:Epoch[19] Batch [800]	Speed: 27309.44 samples/sec	accuracy=0.113906
INFO:root:Epoch[19] Batch [900]	Speed: 26656.13 samples/sec	accuracy=0.105313
INFO:root:Epoch[19] Train-accuracy=0.122044
INFO:root:Epoch[19] Time cost=2.383
INFO:root:Epoch[19] Validation-accuracy=0.113854

Description of a Trial:

Name:         hptest-bt8nsw2z
Namespace:    kubeflow
Labels:       experiment=hptest
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Trial
Metadata:
  Creation Timestamp:  2020-01-28T16:18:39Z
  Finalizers:
    clean-metrics-in-db
  Generation:  1
  Owner References:
    API Version:           kubeflow.org/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  hptest
    UID:                   ab0e3811-41e9-11ea-a0cf-0a9ff0751f4a
  Resource Version:        127230
  Self Link:               /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/trials/hptest-bt8nsw2z
  UID:                     d810a019-41e9-11ea-a0cf-0a9ff0751f4a
Spec:
  Metrics Collector:
  Objective:
    Additional Metric Names:
      accuracy
    Goal:                   0.99
    Objective Metric Name:  Validation-accuracy
    Type:                   maximize
  Parameter Assignments:
    Name:    --lr
    Value:   0.020744080613308936
    Name:    --num-layers
    Value:   3
    Name:    --optimizer
    Value:   sgd
  Run Spec:  apiVersion: batch/v1
kind: Job
metadata:
  name: hptest-bt8nsw2z
  namespace: kubeflow
spec:
  template:
    spec:
      containers:
      - name: hptest-bt8nsw2z
        image: docker.io/katib/mxnet-mnist-example
        command:
        - "python"
        - "/mxnet/example/image-classification/train_mnist.py"
        - "--batch-size=64"
        - "--lr=0.020744080613308936"
        - "--num-layers=3"
        - "--optimizer=sgd"
      restartPolicy: Never
Status:
  Conditions:
    Last Transition Time:  2020-01-28T16:18:39Z
    Last Update Time:      2020-01-28T16:18:39Z
    Message:               Trial is created
    Reason:                TrialCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2020-01-28T16:19:43Z
    Last Update Time:      2020-01-28T16:19:43Z
    Message:               Trial is running
    Reason:                TrialRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2020-01-28T16:19:43Z
    Last Update Time:      2020-01-28T16:19:43Z
    Message:               Metrics are not available
    Reason:                MetricsUnavailable
    Status:                False
    Type:                  Succeeded
  Start Time:              2020-01-28T16:18:39Z
Events:
  Type     Reason              Age                 From              Message
  ----     ------              ----                ----              -------
  Warning  MetricsUnavailable  31m (x2 over 139m)  trial-controller  Metrics are not available for Job hptest-bt8nsw2z

Environment:

  • Kubeflow version: 7.1 from https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.1.yaml
  • Minikube version: N/A. Deployed on EKS.
  • Kubernetes version: (use kubectl version): version.Info{Major:“1”, Minor:“14+”, GitVersion:“v1.14.9-eks-c0eccc”, GitCommit:“c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2”, GitTreeState:“clean”, BuildDate:“2019-12-22T23:14:11Z”, GoVersion:“go1.12.12”, Compiler:“gc”, Platform:“linux/amd64”}
  • OS (e.g. from /etc/os-release): Amazon Linux (?)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:28 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
andreyvelichcommented, Feb 4, 2020

Can you try to do this:

  1. kubectl edit namespace kubeflow
  2. Delete label control-plane=kubeflow
  3. Save changes
  4. Run examples again

It maybe fixes the problem with MetricsCollector. You don’t need to make any changes to Katib examples.

1reaction
jimmy-hawkfishcommented, Aug 21, 2020

sorry @andreyvelich I misspecified the objective metric name and that was what was causing the issue. Thanks for the help and the rapid responses.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Experimental Design & How to avoid blowing everything up
Example of primary metrics for an e-commerce website: add to cart, quantity added to cart, quantity purchased, average order value.
Read more >
A dirty dozen: twelve common metric interpretation pitfalls in ...
These are metrics that can both be measured during the short duration of an experiment, and are also indicative of long term business...
Read more >
Online Experiments Tricks – Variance Reduction - TOPBOTS
In this article, I will walk through some of the popular variance reduction methods and demonstrate some simple examples in Python.
Read more >
exp show | Data Version Control - DVC
Displays experiments and checkpoints in a detailed table which includes their parent and name (or hash), as well as colored columns for (left...
Read more >
Test ads with experiments - Display & Video 360 Help
Evaluate the results of an experiment · Metric: You can evaluate the difference between your baseline and variants to check for statistical significance...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found