Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kubeflow 1.0.2 Katib NAS default example fails due to overwritten container

See original GitHub issue

/kind bug

What steps did you take and what happened:

Using a vanilla Kubeflow 1.0.2 install.
Click on NAS->Submit->Params
Run the default NAS example in a namespace.

The trial pods fail due to a change in the underlying mnist dockerfile.

k logs nasrl-example-7fs7n5db-g28rq metrics-logger-and-collector
I1022 10:08:22.860482      48 main.go:85] Trial Name: nasrl-example-7fs7n5db
I1022 10:08:26.253857      48 main.go:79] usage: mnist.py [-h] [--num-classes NUM_CLASSES] [--num-examples NUM_EXAMPLES]
I1022 10:08:26.253903      48 main.go:79]                 [--add_stn] [--image_shape IMAGE_SHAPE] [--network NETWORK]
I1022 10:08:26.253923      48 main.go:79]                 [--num-layers NUM_LAYERS] [--gpus GPUS] [--kv-store KV_STORE]
I1022 10:08:26.253930      48 main.go:79]                 [--num-epochs NUM_EPOCHS] [--lr LR] [--lr-factor LR_FACTOR]
I1022 10:08:26.253944      48 main.go:79]                 [--lr-step-epochs LR_STEP_EPOCHS] [--initializer INITIALIZER]
I1022 10:08:26.253951      48 main.go:79]                 [--optimizer OPTIMIZER] [--mom MOM] [--wd WD]
I1022 10:08:26.253963      48 main.go:79]                 [--batch-size BATCH_SIZE] [--disp-batches DISP_BATCHES]
I1022 10:08:26.253969      48 main.go:79]                 [--model-prefix MODEL_PREFIX] [--save-period SAVE_PERIOD]
I1022 10:08:26.253983      48 main.go:79]                 [--monitor MONITOR] [--load-epoch LOAD_EPOCH] [--top-k TOP_K]
I1022 10:08:26.253991      48 main.go:79]                 [--loss LOSS] [--test-io TEST_IO] [--dtype DTYPE]
I1022 10:08:26.254005      48 main.go:79]                 [--gc-type GC_TYPE] [--gc-threshold GC_THRESHOLD]
I1022 10:08:26.254017      48 main.go:79]                 [--macrobatch-size MACROBATCH_SIZE]
I1022 10:08:26.254031      48 main.go:79]                 [--warmup-epochs WARMUP_EPOCHS]
I1022 10:08:26.254038      48 main.go:79]                 [--warmup-strategy WARMUP_STRATEGY]
I1022 10:08:26.254051      48 main.go:79]                 [--profile-worker-suffix PROFILE_WORKER_SUFFIX]
I1022 10:08:26.254058      48 main.go:79]                 [--profile-server-suffix PROFILE_SERVER_SUFFIX]
I1022 10:08:26.254083      48 main.go:79]                 [--use-imagenet-data-augmentation USE_IMAGENET_DATA_AUGMENTATION]
I1022 10:08:26.254091      48 main.go:79] mnist.py: error: unrecognized arguments: architecture=[[100], [63, 1], [38, 0, 0], [7, 0, 1, 1], [56, 1, 0, 1, 0], [96, 1, 1, 1, 1, 0], [14, 1, 1, 1, 1, 0, 1], [6, 0, 1, 1, 1, 1, 1, 0]] nn_config={num_layers: 8, input_sizes: [32, 32, 3], output_sizes: [10], embedding: {100: {opt_id: 100, opt_type: depthwise_convolution, opt_params: {filter_size: 7, stride: 2, depth_multiplier: 1}}, 63: {opt_id: 63, opt_type: separable_convolution, opt_params: {filter_size: 5, num_filter: 96, stride: 1, depth_multiplier: 2}}, 38: {opt_id: 38, opt_type: separable_convolution, opt_params: {filter_size: 3, num_filter: 64, stride: 1, depth_multiplier: 1}}, 7: {opt_id: 7, opt_type: convolution, opt_params: {filter_size: 3, num_filter: 96, stride: 2}}, 56: {opt_id: 56, opt_type: separable_convolution, opt_params: {filter_size: 5, num_filter: 48, stride: 2, depth_multiplier: 1}}, 96: {opt_id: 96, opt_type: depthwise_convolution, opt_params: {filter_size: 5, stride: 2, depth_multiplier: 1}}, 14: {opt_id: 14, opt_type: convolution, opt_params: {filter_size: 5, num_filter: 64, stride: 1}}, 6: {opt_id: 6, opt_type: convolution, opt_params: {filter_size: 3, num_filter: 96, stride: 1}}}}
F1022 10:08:26.862068      48 main.go:95] Failed to wait for worker container: Process 8 hadn't completed: open /var/log/katib/8.pid: no such file or directory
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc0001e6100, 0xc0002ec000, 0xa0, 0xf5)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb8
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0x129da40, 0xc000000003, 0xc0002c8000, 0x12378d6, 0x7, 0x5f, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2d0
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0x129da40, 0x3, 0xc78f77, 0x27, 0xc0000d1ed8, 0x1, 0x1)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x14b
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
	/go/src/github.com/kubeflow/katib/cmd/metricscollector/v1alpha3/file-metricscollector/main.go:95 +0x279

I tried to find an older version of the container here: https://hub.docker.com/r/kubeflowkatib/mxnet-mnist/tags, but only one exists, and it was updated 3 months ago. This is likely why it is failing now.

What did you expect to happen:

The default examples should work out of the box. That’s going to be hard to fix now, because the container tag isn’t set.

Ideally, in the future, I’d like properly tagged examples that work forever.

In the meantime, can you suggest the best way of getting a working out-of-the-box NAS example (CPU preferable, for testing), on the 1.0.2 version of Kubeflow?

Thanks, Phil

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

andreyvelichcommented, Nov 13, 2020

@philwinder I close this issue, feel free to re-open if you have any other questions.

0reactions

andreyvelichcommented, Nov 3, 2020

Hi @andreyvelich. Yes it was the nas-template. The problem is that the container has been overwritten to work with KF 1.1, like you said, which has broken the previous example.

Sorry for that. We have added tags to training container images in this PR: https://github.com/kubeflow/katib/pull/1372 to avoid this problem.

use at least Kubeflow 1.1.

Yes, that’s a valid solution, but this particular cluster is stuck on 1.0.2 for the moment.

Thanks again!

Just for your information, you can update Katib version for your Kubeflow cluster without deleting other Kubeflow components.

Delete all Katib experiments: kubectl delete experiment --all-namespaces --all
Use these manifests: https://github.com/kubeflow/katib/tree/master/manifests/v1beta1 to delete and than deploy Katib components. Kubeflow namespace should not be re-created.