question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Katib doesn't support mpijob

See original GitHub issue

/kind bug

What steps did you take and what happened: Deploy katib and mpi-operator in my local kubernetes cluster,

kubectl get po -n kubeflow
NAME                                   READY   STATUS    RESTARTS   AGE
katib-controller-b6dc87fcb-2lrtj       1/1     Running   0          26h
katib-db-manager-79fd46648b-scxx8      1/1     Running   0          2d3h
katib-mysql-7f8bc6956f-fxkgl           1/1     Running   0          13d
katib-ui-74bcbd8b75-bwppw              1/1     Running   0          13d

Use kubectl to create an experiment using MPIJob, the creating result is failed, log is as follows:

Error from server: error when creating "tt-katib.yaml": admission webhook "validating.experiment.katib.kubeflow.org" denied the request: Invalid spec.trialTemplate: Job type kubeflow.org/v1alpha2, Kind=MPIJob not supported.

What did you expect to happen: Experiment created successfully, Trial and MPIJob can run properly.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Now that only job、tfjob、pytorchJob are supported,conside to support mpi-operator.

Environment:

  • Kubernetes version: (use kubectl version): 1.14.1
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.4

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
YuxiJin-tobeyjincommented, May 9, 2020

@gaocegege Thanks for your reply.

OK, thanks to https://github.com/kubeflow/katib/issues/341 , now supporting mpijpb or other kubeflow jobs are not that complicated. As for mpijob the modifications are listed as follows:

  1. Modify katib-controller clusterRole to add mpijobs.
  2. Add mpijob defination in katib const and related handling during job init.
  3. As mpijob has no master, it only consists of launcher and workers, so the metrics sideCar should be added to launcher instead, thus related logic is needed to realize.

I’ve made some tests,here are some results just FYI. My experiment configuration is like this:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: mpi-example1
spec:
  parallelTrialCount: 2                                                                                                                    
  maxTrialCount: 8  
  maxFailedTrialCount: 2
objective:
    type: maximize                                                                                                                         
    goal: 98                                                                                                                             
    objectiveMetricName: Accuracy                                                                                                          
  algorithm:
    algorithmName: random 
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1alpha2"
          kind: MPIJob   
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            slotsPerWorker: 1
            cleanPodPolicy: None
            mpiReplicaSpecs:
              Launcher:
                replicas: 1
                template:
                  spec:
                    schedulerName: kube-batch
                    containers:
                    - image: ***
                      name: pytorch-mnist
                      command:
                      - mpirun
                      ***
                      - python
                      - pytorch_mnist.py
                      - --epochs=2                                                                                                          
                      - --batch-size=64
                      {{- with .HyperParameters}}                                                                                     
                      {{- range .}}
                      - "{{.Name}}={{.Value}}"
                      {{- end}}
                      {{- end}}

              Worker:
              ***
  parameters:                                                                                                                             
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"

After 8 trials my experiment turns to succeeded state, its status detail is:

  status:
    completionTime: "2020-05-07T09:02:42Z"
    conditions:
    - lastTransitionTime: "2020-05-07T08:56:34Z"
      lastUpdateTime: "2020-05-07T08:56:34Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2020-05-07T09:02:42Z"
      lastUpdateTime: "2020-05-07T09:02:42Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2020-05-07T09:02:42Z"
      lastUpdateTime: "2020-05-07T09:02:42Z"
      message: Experiment has succeeded because max trial count has reached
      reason: ExperimentMaxTrialsReached
      status: "True"
      type: Succeeded
    currentOptimalTrial:
      bestTrialName: mpi-example1-dzhq62b5
      observation:
        metrics:
        - name: Accuracy
          value: 96.95
      parameterAssignments:
      - name: --lr
        value: "0.022062715753755423"
    startTime: "2020-05-07T08:56:34Z"
    succeededTrialList:
    - mpi-example1-5qw8hp9g
    - mpi-example1-7zpz4hmv
    - mpi-example1-9vxv2dks
    - mpi-example1-dzhq62b5
    - mpi-example1-kn6plkg7
    - mpi-example1-rfqwgmxh
    - mpi-example1-tbg2bkdx
    - mpi-example1-vtxtrjnd
    trials: 8
    trialsSucceeded: 8
0reactions
johnugeorgecommented, May 11, 2020

LGTM. Thanks @YuxiJin-tobeyjin for your contribution

Read more comments on GitHub >

github_iconTop Results From Across the Web

MPI Training (MPIJob) - Kubeflow
This guide walks you through using MPI for training. The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.
Read more >
How can I install only tfjob, mpijob and pytorch operator
Hello Experts - I would like to spawn distributed training using the mpijob and tfjob operators. However, I do not need to install...
Read more >
Katib - Running an Experiment - 《Kubeflow v1.2 ... - 书栈网
Katib dynamically supports any kind of Kubernetes CRD. ... Kubeflow MPIJob ... Currently, it doesn't support parameter sharing. Katib ...
Read more >
katib module - github.com/kubeflow/katib - Go Packages
Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture ...
Read more >
Advanced Katib Features - Andrey Velichkevich - YouTube
Advanced Katib Features - Andrey Velichkevich. Watch later. Share. Copy link. Info. Shopping. Tap to unmute. If playback doesn't begin ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found