Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Experiment never ends and metrics are not collected

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

I started by deploying Kubeflow using the UI (https://deploy.kubeflow.cloud). Then I simply applied the random example with kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/random-example.yaml The trials container have completed unlike the experiment container that is still running.

I also tried adding a StdOut metricsCollectorSpec. Then (and only then) I have two containers per trial pods (one for the actual model training, and the other for collecting the metrics) but no metrics are collected by the latter. And the experiment never ends too.

katib

katib_2

What did you expect to happen:

Metrics to be collected from the stdout and plotted on the Kubeflow UI.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Here are the logs of the container which is supposed to train the models: empty i.e. it is empty. Whereas the logs for the metrics-logger-and-collector containers are:

I0204 14:12:28.774382 19 main.go:83] Trial Name: random-example-hzchjbjd I0204 14:12:29.547190 19 main.go:77] 2020-02-04T14:12:29Z INFO start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype=‘float32’, gc_threshold=0.5, gc_type=‘none’, gpus=None, image_shape=‘1, 28, 28’, initializer=‘default’, kv_store=‘device’, load_epoch=None, loss=‘’, lr=0.011479313940528528, lr_factor=0.1, lr_step_epochs=‘10’, macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network=‘mlp’, num_classes=10, num_epochs=10, num_examples=60000, num_layers=4, optimizer=‘adam’, profile_server_suffix=‘’, profile_worker_suffix=‘’, save_period=1, test_io=0, top_k=0, use_imagenet_data_augmentation=0, warmup_epochs=5, warmup_strategy=‘linear’, wd=0.0001) I0204 14:12:29.554785 19 main.go:77] 2020-02-04T14:12:29Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:29.721292 19 main.go:77] 2020-02-04T14:12:29Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1” 200 28881 I0204 14:12:29.808254 19 main.go:77] 2020-02-04T14:12:29Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:29.985639 19 main.go:77] 2020-02-04T14:12:29Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/train-images-idx3-ubyte.gz HTTP/1.1” 200 9912422 I0204 14:12:31.340536 19 main.go:77] 2020-02-04T14:12:31Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:31.507767 19 main.go:77] 2020-02-04T14:12:31Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1” 200 4542 I0204 14:12:31.515053 19 main.go:77] 2020-02-04T14:12:31Z DEBUG Starting new HTTP connection (1): yann.lecun.com:80 I0204 14:12:31.700887 19 main.go:77] 2020-02-04T14:12:31Z DEBUG http://yann.lecun.com:80 “GET /exdb/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1” 200 1648877 I0204 14:12:33.214310 19 main.go:77] 2020-02-04T14:12:33Z INFO Epoch[0] Batch [0-100] Speed: 16440.95 samples/sec accuracy=0.827042 I0204 14:12:33.547301 19 main.go:77] 2020-02-04T14:12:33Z INFO Epoch[0] Batch [100-200] Speed: 19240.75 samples/sec accuracy=0.919531

…

I0204 14:13:56.447298 19 main.go:77] 2020-02-04T14:13:56Z INFO Epoch[9] Batch [800-900] Speed: 6905.49 samples/sec accuracy=0.962031 I0204 14:13:56.782314 19 main.go:77] 2020-02-04T14:13:56Z INFO Epoch[9] Train-accuracy=0.960638 I0204 14:13:56.782534 19 main.go:77] 2020-02-04T14:13:56Z INFO Epoch[9] Time cost=9.510 I0204 14:13:57.654289 19 main.go:77] 2020-02-04T14:13:57Z INFO Epoch[9] Validation-accuracy=0.961385 I0204 14:13:58.782090 19 main.go:118] Metrics reported. :

The training is not supposed to happen inside the metric-collector container right ?

Environment:

Kubeflow version: 0.7.1
Minikube version:
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:“1”, Minor:“17”, GitVersion:“v1.17.2”, GitCommit:“59603c6e503c87169aea6106f57b9f242f64df89”, GitTreeState:“clean”, BuildDate:“2020-01-18T23:30:10Z”, GoVersion:“go1.13.5”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“14+”, GitVersion:“v1.14.8-gke.33”, GitCommit:“2c6d0ee462cee7609113bf9e175c107599d5213f”, GitTreeState:“clean”, BuildDate:“2020-01-15T17:47:46Z”, GoVersion:“go1.12.11b4”, Compiler:“gc”, Platform:“linux/amd64”}

OS (e.g. from /etc/os-release): Ubuntu 18.04

Update

I just tried with kubectl v.1.14.8 and v.1.15.7, but the same problem happens

Issue Analytics

State:
Created 4 years ago
Comments:18 (6 by maintainers)

Top GitHub Comments

1reaction

chauvennecommented, Feb 7, 2020

It finally works!

All the steps I have taken are:

Download the latest kfctl version on https://github.com/kubeflow/kfctl/releases/tag/v1.0-rc.4
Deploy Kubeflow on a new cluster using CLI and the latest manifest for gcp (https://github.com/kubeflow/manifests/blob/master/kfdef/kfctl_gcp_iap.yaml)
Edit the kubeflow namespace to remove the control-plane label
Edit the katib-controller deployment to the latest image version
Edit the katib-config configmap to the update the StdOut image to the latest version

Thanks @andreyvelich for the help!

0reactions

wyljpncommented, Mar 23, 2022

I have been using Kubeflow v1.3.0. In the beginning, I also encountered the same issue. After changing the image versions of metrics-collector-sidecar from v0.11.0 to the latest, Katib works well.

docker.io/kubeflowkatib/file-metrics-collector:latest
docker.io/kubeflowkatib/tfevent-metrics-collector:latest

https://github.com/kubeflow/manifests/blob/v1.3-branch/apps/katib/upstream/components/controller/katib-config.yaml

Top Results From Across the Web

Identifying Potential Reasons for Inconsistent Experiment ...

We can never collect 'exact' values and measurements, though we can get pretty close. And because no data are perfect, they inherently have ......

Manage Machine Learning with Amazon SageMaker ...

Use SageMaker Experiments to organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and track your best performing ...

Experiment - Comet Docs

Experiment is a unit of measurable research that defines a single run with some data/parameters/code/results. Creating an Experiment object in your code will ......

Experiment Metrics | Fusion 5.4 | Lucidworks Documentation

Response Time · No Experiment stage – If a query pipeline does not have an Experiment stage, then there is no experiment-processing overhead...

Running an Experiment - Kubeflow

You can run the experiment without specifying the goal . ... metricsCollectorSpec: A specification of how to collect the metrics from each ...