Kubeflow 1.0.2 Katib NAS default example fails due to overwritten container
See original GitHub issue/kind bug
What steps did you take and what happened:
- Using a vanilla Kubeflow 1.0.2 install.
- Click on NAS->Submit->Params
- Run the default NAS example in a namespace.
The trial pods fail due to a change in the underlying mnist dockerfile.
k logs nasrl-example-7fs7n5db-g28rq metrics-logger-and-collector
I1022 10:08:22.860482 48 main.go:85] Trial Name: nasrl-example-7fs7n5db
I1022 10:08:26.253857 48 main.go:79] usage: mnist.py [-h] [--num-classes NUM_CLASSES] [--num-examples NUM_EXAMPLES]
I1022 10:08:26.253903 48 main.go:79] [--add_stn] [--image_shape IMAGE_SHAPE] [--network NETWORK]
I1022 10:08:26.253923 48 main.go:79] [--num-layers NUM_LAYERS] [--gpus GPUS] [--kv-store KV_STORE]
I1022 10:08:26.253930 48 main.go:79] [--num-epochs NUM_EPOCHS] [--lr LR] [--lr-factor LR_FACTOR]
I1022 10:08:26.253944 48 main.go:79] [--lr-step-epochs LR_STEP_EPOCHS] [--initializer INITIALIZER]
I1022 10:08:26.253951 48 main.go:79] [--optimizer OPTIMIZER] [--mom MOM] [--wd WD]
I1022 10:08:26.253963 48 main.go:79] [--batch-size BATCH_SIZE] [--disp-batches DISP_BATCHES]
I1022 10:08:26.253969 48 main.go:79] [--model-prefix MODEL_PREFIX] [--save-period SAVE_PERIOD]
I1022 10:08:26.253983 48 main.go:79] [--monitor MONITOR] [--load-epoch LOAD_EPOCH] [--top-k TOP_K]
I1022 10:08:26.253991 48 main.go:79] [--loss LOSS] [--test-io TEST_IO] [--dtype DTYPE]
I1022 10:08:26.254005 48 main.go:79] [--gc-type GC_TYPE] [--gc-threshold GC_THRESHOLD]
I1022 10:08:26.254017 48 main.go:79] [--macrobatch-size MACROBATCH_SIZE]
I1022 10:08:26.254031 48 main.go:79] [--warmup-epochs WARMUP_EPOCHS]
I1022 10:08:26.254038 48 main.go:79] [--warmup-strategy WARMUP_STRATEGY]
I1022 10:08:26.254051 48 main.go:79] [--profile-worker-suffix PROFILE_WORKER_SUFFIX]
I1022 10:08:26.254058 48 main.go:79] [--profile-server-suffix PROFILE_SERVER_SUFFIX]
I1022 10:08:26.254083 48 main.go:79] [--use-imagenet-data-augmentation USE_IMAGENET_DATA_AUGMENTATION]
I1022 10:08:26.254091 48 main.go:79] mnist.py: error: unrecognized arguments: architecture=[[100], [63, 1], [38, 0, 0], [7, 0, 1, 1], [56, 1, 0, 1, 0], [96, 1, 1, 1, 1, 0], [14, 1, 1, 1, 1, 0, 1], [6, 0, 1, 1, 1, 1, 1, 0]] nn_config={num_layers: 8, input_sizes: [32, 32, 3], output_sizes: [10], embedding: {100: {opt_id: 100, opt_type: depthwise_convolution, opt_params: {filter_size: 7, stride: 2, depth_multiplier: 1}}, 63: {opt_id: 63, opt_type: separable_convolution, opt_params: {filter_size: 5, num_filter: 96, stride: 1, depth_multiplier: 2}}, 38: {opt_id: 38, opt_type: separable_convolution, opt_params: {filter_size: 3, num_filter: 64, stride: 1, depth_multiplier: 1}}, 7: {opt_id: 7, opt_type: convolution, opt_params: {filter_size: 3, num_filter: 96, stride: 2}}, 56: {opt_id: 56, opt_type: separable_convolution, opt_params: {filter_size: 5, num_filter: 48, stride: 2, depth_multiplier: 1}}, 96: {opt_id: 96, opt_type: depthwise_convolution, opt_params: {filter_size: 5, stride: 2, depth_multiplier: 1}}, 14: {opt_id: 14, opt_type: convolution, opt_params: {filter_size: 5, num_filter: 64, stride: 1}}, 6: {opt_id: 6, opt_type: convolution, opt_params: {filter_size: 3, num_filter: 96, stride: 1}}}}
F1022 10:08:26.862068 48 main.go:95] Failed to wait for worker container: Process 8 hadn't completed: open /var/log/katib/8.pid: no such file or directory
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc0001e6100, 0xc0002ec000, 0xa0, 0xf5)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb8
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0x129da40, 0xc000000003, 0xc0002c8000, 0x12378d6, 0x7, 0x5f, 0x0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2d0
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0x129da40, 0x3, 0xc78f77, 0x27, 0xc0000d1ed8, 0x1, 0x1)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x14b
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
/go/src/github.com/kubeflow/katib/cmd/metricscollector/v1alpha3/file-metricscollector/main.go:95 +0x279
I tried to find an older version of the container here: https://hub.docker.com/r/kubeflowkatib/mxnet-mnist/tags, but only one exists, and it was updated 3 months ago. This is likely why it is failing now.
What did you expect to happen:
The default examples should work out of the box. That’s going to be hard to fix now, because the container tag isn’t set.
Ideally, in the future, I’d like properly tagged examples that work forever.
In the meantime, can you suggest the best way of getting a working out-of-the-box NAS example (CPU preferable, for testing), on the 1.0.2 version of Kubeflow?
Thanks, Phil
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Running an Experiment - Kubeflow
This guide describes how to configure and run a Katib experiment. The experiment can perform hyperparameter tuning or a neural architecture ...
Read more >How Katib tunes hyperparameter automatically in a ... - Medium
Now I am going to introduce Katib concepts based on this example. # kubectl get experiment random-example -n kubeflow -o yaml apiVersion: ...
Read more >A Tour of Katib's new UI for Kubeflow 1.3 - YouTube
Kimonas Sotirchos, one of our full stack engineers and approver in the Notebooks Working Group (WG), will take you on a quick tour...
Read more >Kubeflow 1.0 기능 #3 (Katib)
Hyperparameters are the variables that control the model training process. For example: ✓ Learning rate. ✓ Number of layers in ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@philwinder I close this issue, feel free to re-open if you have any other questions.
Sorry for that. We have added tags to training container images in this PR: https://github.com/kubeflow/katib/pull/1372 to avoid this problem.
Just for your information, you can update Katib version for your Kubeflow cluster without deleting other Kubeflow components.
Delete all Katib experiments:
kubectl delete experiment --all-namespaces --all
Use these manifests: https://github.com/kubeflow/katib/tree/master/manifests/v1beta1 to delete and than deploy Katib components. Kubeflow namespace should not be re-created.