Katib example in docs is not working
See original GitHub issue/kind bug
What steps did you take and what happened:
I have a running Kubernetes (two nodes on-prem) cluster and installed Kubeflow using kfctl_k8s_istio config. Followed Getting Started with Katib, I have created a TensorFlow example and go through all 3 steps. This is my tfjob-example.yaml
file:
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
namespace: kubeflow
name: tfjob-example
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy_1
algorithm:
algorithmName: random
metricsCollectorSpec:
source:
fileSystemPath:
path: /train
kind: Directory
collector:
kind: TensorFlowEvent
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: batch_size
parameterType: int
feasibleSpace:
min: "100"
max: "200"
trialTemplate:
primaryContainerName: tensorflow
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: learning_rate
- name: batchSize
description: Batch Size
reference: batch_size
trialSpec:
apiVersion: "kubeflow.org/v1"
kind: TFJob
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
imagePullPolicy: Always
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/metrics"
- "--learning_rate=${trialParameters.learningRate}"
- "--batch_size=${trialParameters.batchSize}"
What did you expect to happen:
I expected to see the graphs and results of the experiments in Katib but all experiments remained in the Running
status, although the logs of experiments containers shows that they are Completed
.
Anything else you would like to add: Is seems the observation_logs is empty:
$ kubectl -n kubeflow exec -it katib-mysql-5df4dddc57-jzdqs -- bash
root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} -e 'show tables;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------------------+
| Tables_in_katib |
+------------------+
| observation_logs |
+------------------+
root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD}
mysql> select * from observation_logs;
Empty set (0.00 sec)
But, I don’t know why it happed and how to trace it. Everything other seems to be alright. Some other logs and debugging that I tried:
$ kubectl get pods --all-namespaces | grep tfj
kubeflow tfjob-example-9sxb2jtg-worker-0 0/1 Completed 0 58m
kubeflow tfjob-example-9sxb2jtg-worker-1 0/1 Completed 0 58m
kubeflow tfjob-example-jtf9d96w-worker-0 0/1 Completed 0 58m
kubeflow tfjob-example-jtf9d96w-worker-1 0/1 Completed 0 58m
kubeflow tfjob-example-random-585dfc8499-r9g4x 1/1 Running 0 58m
kubeflow tfjob-example-twd8tsdk-worker-0 0/1 Completed 0 58m
kubeflow tfjob-example-twd8tsdk-worker-1 0/1 Completed 0 58m
$ kubectl -n kubeflow get experiments
NAME TYPE STATUS AGE
tfjob-example Running True 60m
$ kubectl -n kubeflow get trials
NAME TYPE STATUS AGE
tfjob-example-9sxb2jtg Running True 60m
tfjob-example-jtf9d96w Running True 60m
tfjob-example-twd8tsdk Running True 60m
$ kubectl -n kubeflow logs tfjob-example-9sxb2jtg-worker-0 --all-containers --tail=10
Accuracy at step 910: 0.9444
Accuracy at step 920: 0.9405
Accuracy at step 930: 0.9443
Accuracy at step 940: 0.9459
Accuracy at step 950: 0.9462
Accuracy at step 960: 0.9373
Accuracy at step 970: 0.9404
Accuracy at step 980: 0.945
Accuracy at step 990: 0.9485
Adding run metadata for 999
$ kubectl -n kubeflow logs -f katib-db-manager-59445ff6cb-wkcdp --all-containers
I0125 14:10:19.491012 1 init.go:11] Initializing v1beta1 DB schema
I0125 14:10:19.776431 1 main.go:92] Start Katib manager: 0.0.0.0:6789
$ kubectl -n kubeflow logs katib-controller-545bdfdb46-k6mlr --all-containers --tail=10
{"level":"info","ts":1612085695.0365138,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-twd8tsdk"}
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:35509: remote error: tls: bad certificate
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:32642: remote error: tls: bad certificate
{"level":"info","ts":1612085695.1833804,"logger":"trial-controller","msg":"Creating Job","Trial":"kubeflow/tfjob-example-9sxb2jtg","kind":"TFJob","name":"tfjob-example-9sxb2jtg"}
{"level":"info","ts":1612085695.2675023,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-9sxb2jtg"}
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:7037: remote error: tls: bad certificate
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:12279: remote error: tls: bad certificate
{"level":"info","ts":1612086144.2860768,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.2967129,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.3100634,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
$ kubectl -n kubeflow get experiment tfjob-example -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"tfjob-example","namespace":"kubeflow"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":12,"metricsCollectorSpec":{"collector":{"kind":"TensorFlowEvent"},"source":{"fileSystemPath":{"kind":"Directory","path":"/train"}}},"objective":{"goal":0.99,"objectiveMetricName":"accuracy_1","type":"maximize"},"parallelTrialCount":3,"parameters":[{"feasibleSpace":{"max":"0.05","min":"0.01"},"name":"learning_rate","parameterType":"double"},{"feasibleSpace":{"max":"200","min":"100"},"name":"batch_size","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"tensorflow","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"learning_rate"},{"description":"Batch Size","name":"batchSize","reference":"batch_size"}],"trialSpec":{"apiVersion":"kubeflow.org/v1","kind":"TFJob","spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}},"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir=/train/metrics","--learning_rate=${trialParameters.learningRate}","--batch_size=${trialParameters.batchSize}"],"image":"gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0","imagePullPolicy":"Always","name":"tensorflow"}]}}}}}}}}}
creationTimestamp: "2021-01-31T09:34:38Z"
finalizers:
- update-prometheus-metrics
generation: 1
managedFields:
- apiVersion: kubeflow.org/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:spec:
.: {}
f:algorithm:
.: {}
f:algorithmName: {}
f:maxFailedTrialCount: {}
f:maxTrialCount: {}
f:metricsCollectorSpec:
.: {}
f:collector:
.: {}
f:kind: {}
f:source:
.: {}
f:fileSystemPath:
.: {}
f:kind: {}
f:path: {}
f:objective:
.: {}
f:goal: {}
f:objectiveMetricName: {}
f:type: {}
f:parallelTrialCount: {}
f:parameters: {}
f:trialTemplate:
.: {}
f:primaryContainerName: {}
f:trialParameters: {}
f:trialSpec:
.: {}
f:apiVersion: {}
f:kind: {}
f:spec:
.: {}
f:tfReplicaSpecs:
.: {}
f:Worker:
.: {}
f:replicas: {}
f:restartPolicy: {}
f:template:
.: {}
f:metadata:
.: {}
f:annotations:
.: {}
f:sidecar.istio.io/inject: {}
f:spec:
.: {}
f:containers: {}
manager: kubectl-client-side-apply
operation: Update
time: "2021-01-31T09:34:38Z"
- apiVersion: kubeflow.org/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers: {}
f:status:
.: {}
f:conditions: {}
f:currentOptimalTrial:
.: {}
f:bestTrialName: {}
f:observation:
.: {}
f:metrics: {}
f:parameterAssignments: {}
f:runningTrialList: {}
f:startTime: {}
f:trials: {}
f:trialsRunning: {}
manager: katib-controller
operation: Update
time: "2021-01-31T09:34:55Z"
name: tfjob-example
namespace: kubeflow
resourceVersion: "5381129"
uid: e6aedc20-d3ed-4829-ba49-c2a957427249
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 3
maxTrialCount: 12
metricsCollectorSpec:
collector:
kind: TensorFlowEvent
source:
fileSystemPath:
kind: Directory
path: /train
objective:
goal: 0.99
objectiveMetricName: accuracy_1
type: maximize
parallelTrialCount: 3
parameters:
- feasibleSpace:
max: "0.05"
min: "0.01"
name: learning_rate
parameterType: double
- feasibleSpace:
max: "200"
min: "100"
name: batch_size
parameterType: int
trialTemplate:
primaryContainerName: tensorflow
trialParameters:
- description: Learning rate for the training model
name: learningRate
reference: learning_rate
- description: Batch Size
name: batchSize
reference: batch_size
trialSpec:
apiVersion: kubeflow.org/v1
kind: TFJob
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python
- /var/tf_mnist/mnist_with_summaries.py
- --log_dir=/train/metrics
- --learning_rate=${trialParameters.learningRate}
- --batch_size=${trialParameters.batchSize}
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
imagePullPolicy: Always
name: tensorflow
status:
conditions:
- lastTransitionTime: "2021-01-31T09:34:38Z"
lastUpdateTime: "2021-01-31T09:34:38Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
- lastTransitionTime: "2021-01-31T09:34:54Z"
lastUpdateTime: "2021-01-31T09:34:54Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
bestTrialName: ""
observation:
metrics: null
parameterAssignments: null
runningTrialList:
- tfjob-example-9sxb2jtg
- tfjob-example-jtf9d96w
- tfjob-example-twd8tsdk
startTime: "2021-01-31T09:34:38Z"
trials: 3
trialsRunning: 3
Environment:
- Kubeflow version:
kfctl v1.2.0-0-gbc038f9
- Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
- OS :
Ubuntu 20.04.1 LTS
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (2 by maintainers)
Top Results From Across the Web
Getting Started with Katib - Kubeflow
This guide shows how to get started with Katib and run a few examples using the command line and the Katib user interface...
Read more >Set Up a Katib Job — Rok 2.0 documentation
When running a job, Katib will explore a set of allowable values for the search space parameters (hyperparameters) in order to identify the...
Read more >Katib | Kubeflow on AWS
Follow those steps to configure the profile controller to work with the AwsIamForServiceAccount plugin. The following is an example of a ...
Read more >Using hyperparameter tuning | AI Platform Training
If you do not set maxFailedTrials , or if you set it to 0 , AI Platform ... See the ResNet-50 TPU hyperparameter...
Read more >End-to-End Hyperparameter Tuning with Katib, Tensorflow ...
It is also not trivial to get familiar with tools and methods that ... If you're running a Keras models, such as the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @Gorosia. So I close this issue.
@azarezade Check here 👍 I fixed error that same as your error.