Experiment never ends and metrics are not collected
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
I started by deploying Kubeflow using the UI (https://deploy.kubeflow.cloud). Then I simply applied the random example with kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/random-example.yaml
The trials container have completed unlike the experiment container that is still running.
I also tried adding a StdOut metricsCollectorSpec
. Then (and only then) I have two containers per trial pods (one for the actual model training, and the other for collecting the metrics) but no metrics are collected by the latter. And the experiment never ends too.
What did you expect to happen:
Metrics to be collected from the stdout and plotted on the Kubeflow UI.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Here are the logs of the container which is supposed to train the models: i.e. it is empty. Whereas the logs for the metrics-logger-and-collector containers are:
I0204 14:12:28.774382 19 main.go:83] Trial Name: random-example-hzchjbjd I0204 14:12:29.547190 19 main.go:77] 2020-02-04T14:12:29Z INFO start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype=‘float32’, gc_threshold=0.5, gc_type=‘none’, gpus=None, image_shape=‘1, 28, 28’, initializer=‘default’, kv_store=‘device’, load_epoch=None, loss=‘’, lr=0.011479313940528528, lr_factor=0.1, lr_step_epochs=‘10’, macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network=‘mlp’, num_classes=10, num_epochs=10, num_examples=60000, num_layers=4, optimizer=‘adam’, profile_server_suffix=‘’, profile_worker_suffix=‘’, save_period=1, test_io=0, top_k=0, use_imagenet_data_augmentation=0, warmup_epochs=5, warmup_strategy=‘linear’, wd=0.0001) I0204 14:12:29.554785 19 main.go:77] 2020-02-04T14:12:29Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:29.721292 19 main.go:77] 2020-02-04T14:12:29Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1” 200 28881 I0204 14:12:29.808254 19 main.go:77] 2020-02-04T14:12:29Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:29.985639 19 main.go:77] 2020-02-04T14:12:29Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/train-images-idx3-ubyte.gz HTTP/1.1” 200 9912422 I0204 14:12:31.340536 19 main.go:77] 2020-02-04T14:12:31Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:31.507767 19 main.go:77] 2020-02-04T14:12:31Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1” 200 4542 I0204 14:12:31.515053 19 main.go:77] 2020-02-04T14:12:31Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:31.700887 19 main.go:77] 2020-02-04T14:12:31Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1” 200 1648877 I0204 14:12:33.214310 19 main.go:77] 2020-02-04T14:12:33Z INFO Epoch[0] Batch [0-100] Speed: 16440.95 samples/sec accuracy=0.827042 I0204 14:12:33.547301 19 main.go:77] 2020-02-04T14:12:33Z INFO Epoch[0] Batch [100-200] Speed: 19240.75 samples/sec accuracy=0.919531
…
I0204 14:13:56.447298 19 main.go:77] 2020-02-04T14:13:56Z INFO Epoch[9] Batch [800-900] Speed: 6905.49 samples/sec accuracy=0.962031 I0204 14:13:56.782314 19 main.go:77] 2020-02-04T14:13:56Z INFO Epoch[9] Train-accuracy=0.960638 I0204 14:13:56.782534 19 main.go:77] 2020-02-04T14:13:56Z INFO Epoch[9] Time cost=9.510 I0204 14:13:57.654289 19 main.go:77] 2020-02-04T14:13:57Z INFO Epoch[9] Validation-accuracy=0.961385 I0204 14:13:58.782090 19 main.go:118] Metrics reported. :
The training is not supposed to happen inside the metric-collector container right ?
Environment:
- Kubeflow version: 0.7.1
- Minikube version:
- Kubernetes version: (use
kubectl version
):
Client Version: version.Info{Major:“1”, Minor:“17”, GitVersion:“v1.17.2”, GitCommit:“59603c6e503c87169aea6106f57b9f242f64df89”, GitTreeState:“clean”, BuildDate:“2020-01-18T23:30:10Z”, GoVersion:“go1.13.5”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“14+”, GitVersion:“v1.14.8-gke.33”, GitCommit:“2c6d0ee462cee7609113bf9e175c107599d5213f”, GitTreeState:“clean”, BuildDate:“2020-01-15T17:47:46Z”, GoVersion:“go1.12.11b4”, Compiler:“gc”, Platform:“linux/amd64”}
- OS (e.g. from
/etc/os-release
): Ubuntu 18.04
Update
I just tried with kubectl v.1.14.8 and v.1.15.7, but the same problem happens
Issue Analytics
- State:
- Created 4 years ago
- Comments:18 (6 by maintainers)
Top GitHub Comments
It finally works!
All the steps I have taken are:
kfctl
version on https://github.com/kubeflow/kfctl/releases/tag/v1.0-rc.4control-plane
labelThanks @andreyvelich for the help!
I have been using Kubeflow v1.3.0. In the beginning, I also encountered the same issue. After changing the image versions of metrics-collector-sidecar from v0.11.0 to the latest, Katib works well.
https://github.com/kubeflow/manifests/blob/v1.3-branch/apps/katib/upstream/components/controller/katib-config.yaml