question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

katib experiment is stuck at creating

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

import os
import tensorflow as tf
import numpy as np
import argparse
from datetime import datetime, timezone

def train():
    print("TensorFlow version: ", tf.__version__)

    parser = argparse.ArgumentParser()
    parser.add_argument('--learning_rate', default=0.01, type=float)
    parser.add_argument('--dropout', default=0.2, type=float)
    args = parser.parse_args()

    mnist = tf.keras.datasets.mnist

    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0

    # Reserve 10,000 samples for validation
    x_val = x_train[-10000:]
    y_val = y_train[-10000:]
    x_train = x_train[:-10000]
    y_train = y_train[:-10000]

    model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape=(28, 28)),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dropout(args.dropout),
      tf.keras.layers.Dense(10, activation='softmax')
    ])

    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=args.learning_rate),
                  loss='sparse_categorical_crossentropy',
                  metrics=['acc'])

    print("Training...")

    katib_metric_log_callback = KatibMetricLog()
    training_history = model.fit(x_train, y_train, batch_size=64, epochs=10,
                                 validation_data=(x_val, y_val),
                                 callbacks=[katib_metric_log_callback])

    print("\\ntraining_history:", training_history.history)

    # Evaluate the model on the test data using `evaluate`
    print('\\n# Evaluate on test data')
    results = model.evaluate(x_test, y_test, batch_size=128)
    print('test loss, test acc:', results)


class KatibMetricLog(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        # RFC 3339
        local_time = datetime.now(timezone.utc).astimezone().isoformat()
        print("\\nEpoch {}".format(epoch+1))
        print("{} accuracy={:.4f}".format(local_time, logs['acc']))
        print("{} loss={:.4f}".format(local_time, logs['loss']))
        print("{} Validation-accuracy={:.4f}".format(local_time, logs['val_acc']))
        print("{} Validation-loss={:.4f}".format(local_time, logs['val_loss']))


if __name__ == '__main__':
    train()

I applied the py file to the following yaml file.

apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: plask
  name: random-job-example
spec:
  parallelTrialCount: 1
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - accuracy
  algorithm:
    algorithmName: random
  parameters:
    - name: --learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.2"
    - name: --dropout
      parameterType: double
      feasibleSpace:
        min: "0.1"
        max: "0.5"
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: random-job-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: learning_rate
    - description: Dropout rate of training model layers
      name: dropout
      reference: dropout
    trialSpec:
          apiVersion: "kubeflow.org/v1"
          kind: Job
          metadata:
            annotations:
              sidecar.istio.io/inject: "false"
          spec:
            template:
              spec:
                containers:
                - name: random-job-container
                  image: plask/katib-mnist-job:0.0.1
                  command:
                  - "python3"
                  - "/app/katib-mnist-random-job.py"
                  - --learning_rate=${trialParameters.learningRate}
                  - --dropout=${trialParameters.dropout}
                restartPolicy: Never

image

The experiment pod is running, but the experiment continues to create. The following are the results of description of experiment.

Name:         random-job-example
Namespace:    plask
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-05-28T07:04:37Z
  Finalizers:
    update-prometheus-metrics
  Generation:  1
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
      f:status:
        .:
        f:completionTime:
        f:conditions:
        f:currentOptimalTrial:
          .:
          f:bestTrialName:
          f:observation:
            .:
            f:metrics:
          f:parameterAssignments:
        f:startTime:
    Manager:      katib-controller
    Operation:    Update
    Time:         2021-05-28T07:04:37Z
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
        f:maxFailedTrialCount:
        f:maxTrialCount:
        f:objective:
          .:
          f:additionalMetricNames:
          f:goal:
          f:objectiveMetricName:
          f:type:
        f:parallelTrialCount:
        f:parameters:
        f:trialTemplate:
          .:
          f:failureCondition:
          f:primaryContainerName:
          f:successCondition:
          f:trialParameters:
          f:trialSpec:
            .:
            f:apiVersion:
            f:kind:
            f:metadata:
              .:
              f:annotations:
                .:
                f:sidecar.istio.io/inject:
            f:spec:
              .:
              f:template:
                .:
                f:spec:
                  .:
                  f:containers:
                  f:restartPolicy:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2021-05-28T07:04:37Z
  Resource Version:  564107
  UID:               6b69ffea-df16-4879-a6ba-87aaf31d7203
Spec:
  Algorithm:
    Algorithm Name:        random
  Max Failed Trial Count:  3
  Max Trial Count:         12
  Metrics Collector Spec:
    Collector:
      Kind:  StdOut
  Objective:
    Additional Metric Names:
      accuracy
    Goal:  0.99
    Metric Strategies:
      Name:                 Validation-accuracy
      Value:                max
      Name:                 accuracy
      Value:                max
    Objective Metric Name:  Validation-accuracy
    Type:                   maximize
  Parallel Trial Count:     1
  Parameters:
    Feasible Space:
      Max:           0.2
      Min:           0.01
    Name:            --learning_rate
    Parameter Type:  double
    Feasible Space:
      Max:           0.5
      Min:           0.1
    Name:            --dropout
    Parameter Type:  double
  Resume Policy:     LongRunning
  Trial Template:
    Failure Condition:       status.conditions.#(type=="Failed")#|#(status=="True")#
    Primary Container Name:  random-job-container
    Success Condition:       status.conditions.#(type=="Complete")#|#(status=="True")#
    Trial Parameters:
      Description:  Learning rate for the training model
      Name:         learningRate
      Reference:    learning_rate
      Description:  Dropout rate of training model layers
      Name:         dropout
      Reference:    dropout
    Trial Spec:
      API Version:  kubeflow.org/v1
      Kind:         Job
      Metadata:
        Annotations:
          sidecar.istio.io/inject:  false
      Spec:
        Template:
          Spec:
            Containers:
              Command:
                python3
                /app/katib-mnist-random-job.py
                --learning_rate=${trialParameters.learningRate}
                --dropout=${trialParameters.dropout}
              Image:         plask/katib-mnist-job:0.0.1
              Name:          random-job-container
            Restart Policy:  Never
Status:
  Completion Time:  <nil>
  Conditions:
    Last Transition Time:  2021-05-28T07:04:37Z
    Last Update Time:      2021-05-28T07:04:37Z
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
  Current Optimal Trial:
    Best Trial Name:  
    Observation:
      Metrics:              <nil>
    Parameter Assignments:  <nil>
  Start Time:               2021-05-28T07:04:37Z
Events:                     <none>

How can I proceed with my experiment normally?

What did you expect to happen:

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Kubeflow version (kfctl version): 1.20
  • Minikube version (minikube version): x
  • Kubernetes version: (use kubectl version): 1.20
  • OS (e.g. from /etc/os-release): ubuntu 18.04

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
seoha-kimcommented, Jun 1, 2021

넵 감사합니다ㅠㅠ 덕분에 experiments가 시작됩니다…! 다른 튜토리얼로도 잘 돌아가나 검증해봤는데 잘 돌아가는 것 같습니다! 포맷 맞추는게 문제였네요…ㅠㅠ 덕분에 일주일넘게 애먹이던게 해결이 되었네요 잘 사용하겠습니다!

2reactions
gaocegegecommented, Jun 2, 2021

Thanks, learned some Korean from the issue 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

Katib experiment not running, stuck at experiment created.
The official documentation for tfjob example on katib. The experiment gets created but nothing runs or is generated. User-namespace is admin.
Read more >
Running an Experiment - Kubeflow
This guide describes how to configure and run a Katib experiment. The experiment can perform hyperparameter tuning or a neural architecture ...
Read more >
Katib Concepts — Rok 2.0 documentation
An experiment is a single tuning run, also called an optimization run. You specify configuration settings to define the experiment.
Read more >
Distributed Hyperparameter Search in Kubeflow/Kubernetes
Distributed Hyperparameter Search in Kubeflow/Kubernetes: Keras Tuner vs. Katib · Deployment · Creation · Cheif-Workers · I didn't try it, but TFJob ...
Read more >
blog - Coda Global
There are multiple ways of creating and running pipelines in Kubeflow. Let's have a look at them: ... Katib Experiment Results View (Image...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found