Katib doesn't support mpijob
See original GitHub issue/kind bug
What steps did you take and what happened: Deploy katib and mpi-operator in my local kubernetes cluster,
kubectl get po -n kubeflow
NAME READY STATUS RESTARTS AGE
katib-controller-b6dc87fcb-2lrtj 1/1 Running 0 26h
katib-db-manager-79fd46648b-scxx8 1/1 Running 0 2d3h
katib-mysql-7f8bc6956f-fxkgl 1/1 Running 0 13d
katib-ui-74bcbd8b75-bwppw 1/1 Running 0 13d
Use kubectl to create an experiment using MPIJob, the creating result is failed, log is as follows:
Error from server: error when creating "tt-katib.yaml": admission webhook "validating.experiment.katib.kubeflow.org" denied the request: Invalid spec.trialTemplate: Job type kubeflow.org/v1alpha2, Kind=MPIJob not supported.
What did you expect to happen: Experiment created successfully, Trial and MPIJob can run properly.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Now that only job、tfjob、pytorchJob are supported,conside to support mpi-operator.
Environment:
- Kubernetes version: (use
kubectl version
): 1.14.1 - OS (e.g. from
/etc/os-release
): Ubuntu 16.04.4
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
MPI Training (MPIJob) - Kubeflow
This guide walks you through using MPI for training. The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.
Read more >How can I install only tfjob, mpijob and pytorch operator
Hello Experts - I would like to spawn distributed training using the mpijob and tfjob operators. However, I do not need to install...
Read more >Katib - Running an Experiment - 《Kubeflow v1.2 ... - 书栈网
Katib dynamically supports any kind of Kubernetes CRD. ... Kubeflow MPIJob ... Currently, it doesn't support parameter sharing. Katib ...
Read more >katib module - github.com/kubeflow/katib - Go Packages
Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture ...
Read more >Advanced Katib Features - Andrey Velichkevich - YouTube
Advanced Katib Features - Andrey Velichkevich. Watch later. Share. Copy link. Info. Shopping. Tap to unmute. If playback doesn't begin ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@gaocegege Thanks for your reply.
OK, thanks to https://github.com/kubeflow/katib/issues/341 , now supporting mpijpb or other kubeflow jobs are not that complicated. As for mpijob the modifications are listed as follows:
I’ve made some tests,here are some results just FYI. My experiment configuration is like this:
After 8 trials my experiment turns to succeeded state, its status detail is:
LGTM. Thanks @YuxiJin-tobeyjin for your contribution