katib experiment is stuck at creating
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
import os
import tensorflow as tf
import numpy as np
import argparse
from datetime import datetime, timezone
def train():
print("TensorFlow version: ", tf.__version__)
parser = argparse.ArgumentParser()
parser.add_argument('--learning_rate', default=0.01, type=float)
parser.add_argument('--dropout', default=0.2, type=float)
args = parser.parse_args()
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# Reserve 10,000 samples for validation
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(args.dropout),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=args.learning_rate),
loss='sparse_categorical_crossentropy',
metrics=['acc'])
print("Training...")
katib_metric_log_callback = KatibMetricLog()
training_history = model.fit(x_train, y_train, batch_size=64, epochs=10,
validation_data=(x_val, y_val),
callbacks=[katib_metric_log_callback])
print("\\ntraining_history:", training_history.history)
# Evaluate the model on the test data using `evaluate`
print('\\n# Evaluate on test data')
results = model.evaluate(x_test, y_test, batch_size=128)
print('test loss, test acc:', results)
class KatibMetricLog(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
# RFC 3339
local_time = datetime.now(timezone.utc).astimezone().isoformat()
print("\\nEpoch {}".format(epoch+1))
print("{} accuracy={:.4f}".format(local_time, logs['acc']))
print("{} loss={:.4f}".format(local_time, logs['loss']))
print("{} Validation-accuracy={:.4f}".format(local_time, logs['val_acc']))
print("{} Validation-loss={:.4f}".format(local_time, logs['val_loss']))
if __name__ == '__main__':
train()
I applied the py file to the following yaml file.
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
namespace: plask
name: random-job-example
spec:
parallelTrialCount: 1
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- accuracy
algorithm:
algorithmName: random
parameters:
- name: --learning_rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.2"
- name: --dropout
parameterType: double
feasibleSpace:
min: "0.1"
max: "0.5"
trialTemplate:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
primaryContainerName: random-job-container
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
trialParameters:
- description: Learning rate for the training model
name: learningRate
reference: learning_rate
- description: Dropout rate of training model layers
name: dropout
reference: dropout
trialSpec:
apiVersion: "kubeflow.org/v1"
kind: Job
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
template:
spec:
containers:
- name: random-job-container
image: plask/katib-mnist-job:0.0.1
command:
- "python3"
- "/app/katib-mnist-random-job.py"
- --learning_rate=${trialParameters.learningRate}
- --dropout=${trialParameters.dropout}
restartPolicy: Never
The experiment pod is running, but the experiment continues to create. The following are the results of description of experiment.
Name: random-job-example
Namespace: plask
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1beta1
Kind: Experiment
Metadata:
Creation Timestamp: 2021-05-28T07:04:37Z
Finalizers:
update-prometheus-metrics
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
f:status:
.:
f:completionTime:
f:conditions:
f:currentOptimalTrial:
.:
f:bestTrialName:
f:observation:
.:
f:metrics:
f:parameterAssignments:
f:startTime:
Manager: katib-controller
Operation: Update
Time: 2021-05-28T07:04:37Z
API Version: kubeflow.org/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:algorithm:
.:
f:algorithmName:
f:maxFailedTrialCount:
f:maxTrialCount:
f:objective:
.:
f:additionalMetricNames:
f:goal:
f:objectiveMetricName:
f:type:
f:parallelTrialCount:
f:parameters:
f:trialTemplate:
.:
f:failureCondition:
f:primaryContainerName:
f:successCondition:
f:trialParameters:
f:trialSpec:
.:
f:apiVersion:
f:kind:
f:metadata:
.:
f:annotations:
.:
f:sidecar.istio.io/inject:
f:spec:
.:
f:template:
.:
f:spec:
.:
f:containers:
f:restartPolicy:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2021-05-28T07:04:37Z
Resource Version: 564107
UID: 6b69ffea-df16-4879-a6ba-87aaf31d7203
Spec:
Algorithm:
Algorithm Name: random
Max Failed Trial Count: 3
Max Trial Count: 12
Metrics Collector Spec:
Collector:
Kind: StdOut
Objective:
Additional Metric Names:
accuracy
Goal: 0.99
Metric Strategies:
Name: Validation-accuracy
Value: max
Name: accuracy
Value: max
Objective Metric Name: Validation-accuracy
Type: maximize
Parallel Trial Count: 1
Parameters:
Feasible Space:
Max: 0.2
Min: 0.01
Name: --learning_rate
Parameter Type: double
Feasible Space:
Max: 0.5
Min: 0.1
Name: --dropout
Parameter Type: double
Resume Policy: LongRunning
Trial Template:
Failure Condition: status.conditions.#(type=="Failed")#|#(status=="True")#
Primary Container Name: random-job-container
Success Condition: status.conditions.#(type=="Complete")#|#(status=="True")#
Trial Parameters:
Description: Learning rate for the training model
Name: learningRate
Reference: learning_rate
Description: Dropout rate of training model layers
Name: dropout
Reference: dropout
Trial Spec:
API Version: kubeflow.org/v1
Kind: Job
Metadata:
Annotations:
sidecar.istio.io/inject: false
Spec:
Template:
Spec:
Containers:
Command:
python3
/app/katib-mnist-random-job.py
--learning_rate=${trialParameters.learningRate}
--dropout=${trialParameters.dropout}
Image: plask/katib-mnist-job:0.0.1
Name: random-job-container
Restart Policy: Never
Status:
Completion Time: <nil>
Conditions:
Last Transition Time: 2021-05-28T07:04:37Z
Last Update Time: 2021-05-28T07:04:37Z
Message: Experiment is created
Reason: ExperimentCreated
Status: True
Type: Created
Current Optimal Trial:
Best Trial Name:
Observation:
Metrics: <nil>
Parameter Assignments: <nil>
Start Time: 2021-05-28T07:04:37Z
Events: <none>
How can I proceed with my experiment normally?
What did you expect to happen:
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- Kubeflow version (
kfctl version
): 1.20 - Minikube version (
minikube version
): x - Kubernetes version: (use
kubectl version
): 1.20 - OS (e.g. from
/etc/os-release
): ubuntu 18.04
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
Katib experiment not running, stuck at experiment created.
The official documentation for tfjob example on katib. The experiment gets created but nothing runs or is generated. User-namespace is admin.
Read more >Running an Experiment - Kubeflow
This guide describes how to configure and run a Katib experiment. The experiment can perform hyperparameter tuning or a neural architecture ...
Read more >Katib Concepts — Rok 2.0 documentation
An experiment is a single tuning run, also called an optimization run. You specify configuration settings to define the experiment.
Read more >Distributed Hyperparameter Search in Kubeflow/Kubernetes
Distributed Hyperparameter Search in Kubeflow/Kubernetes: Keras Tuner vs. Katib · Deployment · Creation · Cheif-Workers · I didn't try it, but TFJob ...
Read more >blog - Coda Global
There are multiple ways of creating and running pipelines in Kubeflow. Let's have a look at them: ... Katib Experiment Results View (Image...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
넵 감사합니다ㅠㅠ 덕분에 experiments가 시작됩니다…! 다른 튜토리얼로도 잘 돌아가나 검증해봤는데 잘 돌아가는 것 같습니다! 포맷 맞추는게 문제였네요…ㅠㅠ 덕분에 일주일넘게 애먹이던게 해결이 되었네요 잘 사용하겠습니다!
Thanks, learned some Korean from the issue 😄