question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannnot create pod when using xgboost operator

See original GitHub issue

/kind bug

What steps did you take and what happened: Hi guys, Im trying to run katib with xgboost operater. My katib can work with pytouch, tf, job etc operator but only cannot work with the xgboost operator since it cannot create pod after getting trials. i’m using this yaml file: xgboostjob-lightgbm.yaml https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-training-operator/xgboostjob-lightgbm.yaml i create a seperate katib in namespace katib (using katib-standalone and change cluster role and katib-controller), and install the xgboost operator (can pass the web test). here is the katib and xgboost-operator information:

[root@ip-172-31-38-13 wallace]# kubectl  get all
NAME                                              READY   STATUS        RESTARTS   AGE
pod/katib-cert-generator-m2ksx                    0/1     Completed     0          4h25m
pod/katib-controller-6d6fdd9c84-4wnkx             1/1     Running       0          4h25m
pod/katib-db-manager-b6f785f69-44wkp              1/1     Running       0          4h25m
pod/katib-mysql-6dcb447c6f-smhk2                  1/1     Running       0          4h25m
pod/katib-ui-5767cfccdc-kpd77                     1/1     Running       0          4h25m 

NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
service/katib-controller   ClusterIP   10.100.56.32     <none>        443/TCP,8080/TCP   4h25m
service/katib-db-manager   ClusterIP   10.100.152.185   <none>        6789/TCP           4h25m
service/katib-mysql        ClusterIP   10.100.30.233    <none>        3306/TCP           4h25m
service/katib-ui           ClusterIP   10.100.148.82    <none>        80/TCP             4h25m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/katib-controller   1/1     1            1           4h25m
deployment.apps/katib-db-manager   1/1     1            1           4h25m
deployment.apps/katib-mysql        1/1     1            1           4h25m
deployment.apps/katib-ui           1/1     1            1           4h25m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/katib-controller-6d6fdd9c84   1         1         1       4h25m
replicaset.apps/katib-db-manager-b6f785f69    1         1         1       4h25m
replicaset.apps/katib-mysql-6dcb447c6f        1         1         1       4h25m
replicaset.apps/katib-ui-5767cfccdc           1         1         1       4h25m

NAME                             COMPLETIONS   DURATION   AGE
job.batch/katib-cert-generator   1/1           7s         4h25m
[root@ip-172-31-38-13 wallace]# kubectl get crd xgboostjobs.xgboostjob.kubeflow.org
NAME                                  CREATED AT
xgboostjobs.xgboostjob.kubeflow.org   2021-11-18T15:12:31Z
[root@ip-172-31-38-13 wallace]# kubectl logs $(kubectl get pods -n katib -o name | grep katib-controller) -n katib | grep '"CRD Kind":"XGBoostJob"'
{"level":"info","ts":1637287330.0504189,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"xgboostjob.kubeflow.org","CRD Version":"v1","CRD Kind":"XGBoostJob"}`
`kubectl edit clusterroles xgboost-operator-cluster-role
- apiGroups:
  - xgboostjob.kubeflow.org
  resources:
  - xgboostjobs
  - xgboostjobs/status
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete

then i apply xgboostjob-lightgbm.yaml:

[root@ip-172-31-38-13 wallace]# kubectl apply -f xgboostjob-lightgbm.yaml
experiment.kubeflow.org/xgboostjob-lightgbm created

the whole process will stuck here:

[root@ip-172-31-38-13 wallace]# kubectl get all
NAME                                              READY   STATUS      RESTARTS   AGE
pod/katib-cert-generator-m2ksx                    0/1     Completed   0          4h31m
pod/katib-controller-6d6fdd9c84-4wnkx             1/1     Running     0          4h31m
pod/katib-db-manager-b6f785f69-44wkp              1/1     Running     0          4h31m
pod/katib-mysql-6dcb447c6f-smhk2                  1/1     Running     0          4h31m
pod/katib-ui-5767cfccdc-kpd77                     1/1     Running     0          4h31m
pod/xgboostjob-lightgbm-random-67c4f69b9c-69vwz   1/1     Running     0          31s

NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
service/katib-controller             ClusterIP   10.100.56.32     <none>        443/TCP,8080/TCP   4h31m
service/katib-db-manager             ClusterIP   10.100.152.185   <none>        6789/TCP           4h31m
service/katib-mysql                  ClusterIP   10.100.30.233    <none>        3306/TCP           4h31m
service/katib-ui                     ClusterIP   10.100.148.82    <none>        80/TCP             4h31m
service/xgboostjob-lightgbm-random   ClusterIP   10.100.125.219   <none>        6789/TCP           31s

NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/katib-controller             1/1     1            1           4h31m
deployment.apps/katib-db-manager             1/1     1            1           4h31m
deployment.apps/katib-mysql                  1/1     1            1           4h31m
deployment.apps/katib-ui                     1/1     1            1           4h31m
deployment.apps/xgboostjob-lightgbm-random   1/1     1            1           31s

NAME                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/katib-controller-6d6fdd9c84             1         1         1       4h31m
replicaset.apps/katib-db-manager-b6f785f69              1         1         1       4h31m
replicaset.apps/katib-mysql-6dcb447c6f                  1         1         1       4h31m
replicaset.apps/katib-ui-5767cfccdc                     1         1         1       4h31m
replicaset.apps/xgboostjob-lightgbm-random-67c4f69b9c   1         1         1       31s

NAME                             COMPLETIONS   DURATION   AGE
job.batch/katib-cert-generator   1/1           7s         4h31m

NAME                                          TYPE      STATUS   REQUESTED   ASSIGNED   AGE
suggestion.kubeflow.org/xgboostjob-lightgbm   Running   True     6           6          31s

NAME                                          TYPE      STATUS   AGE
experiment.kubeflow.org/xgboostjob-lightgbm   Running   True     31s

NAME                                              TYPE      STATUS   AGE
trial.kubeflow.org/xgboostjob-lightgbm-5vgdtcs8   Running   True     10s
trial.kubeflow.org/xgboostjob-lightgbm-fxsclkkt   Running   True     10s
trial.kubeflow.org/xgboostjob-lightgbm-htls9mrz   Running   True     10s
trial.kubeflow.org/xgboostjob-lightgbm-kq8wblpm   Running   True     10s
trial.kubeflow.org/xgboostjob-lightgbm-lv8w2tsn   Running   True     10s
trial.kubeflow.org/xgboostjob-lightgbm-wtjwxpzj   Running   True     10s

What did you expect to happen: the trials should create pods, however they did not, i do not know where are the mistakes?

Anything else you would like to add: Here are some additional information:

[root@ip-172-31-38-13 wallace]# kubectl get experiment.kubeflow.org/xgboostjob-lightgbm -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"xgboostjob-lightgbm","namespace":"katib"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":6,"metricsCollectorSpec":{"source":{"filter":{"metricsFormat":["(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"]}}},"objective":{"additionalMetricNames":["valid_1 binary_logloss","training auc","training binary_logloss"],"goal":0.99,"objectiveMetricName":"valid_1 auc","type":"maximize"},"parallelTrialCount":7,"parameters":[{"feasibleSpace":{"max":"0.1","min":"0.01"},"name":"lr","parameterType":"double"},{"feasibleSpace":{"max":"60","min":"50","step":"1"},"name":"num-leaves","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"xgboost","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"lr"},{"description":"Number of leaves for one tree","name":"numberLeaves","reference":"num-leaves"}],"trialSpec":{"apiVersion":"xgboostjob.kubeflow.org/v1","kind":"XGBoostJob","spec":{"xgbReplicaSpecs":{"Master":{"replicas":1,"restartPolicy":"Never","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}},"Worker":{"replicas":2,"restartPolicy":"ExitCode","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}}}}}}}}
  creationTimestamp: "2021-11-19T06:32:26Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: xgboostjob-lightgbm
  namespace: katib
  resourceVersion: "132795443"
  selfLink: /apis/kubeflow.org/v1beta1/namespaces/katib/experiments/xgboostjob-lightgbm
  uid: 10ed205c-14e0-439e-949f-d0649f433ba7
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 6
  metricsCollectorSpec:
    source:
      filter:
        metricsFormat:
        - (\w+\s\w+)\s:\s((-?\d+)(\.\d+)?)
  objective:
    additionalMetricNames:
    - valid_1 binary_logloss
    - training auc
    - training binary_logloss
    goal: 0.99
    objectiveMetricName: valid_1 auc
    type: maximize
  parallelTrialCount: 7
  parameters:
  - feasibleSpace:
      max: "0.1"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "60"
      min: "50"
      step: "1"
    name: num-leaves
    parameterType: int
  trialTemplate:
    primaryContainerName: xgboost
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Number of leaves for one tree
      name: numberLeaves
      reference: num-leaves
    trialSpec:
      apiVersion: xgboostjob.kubeflow.org/v1
      kind: XGBoostJob
      spec:
        xgbReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: Never
            template:
              spec:
                containers:
                - args:
                  - --job_type=Train
                  - --metric=binary_logloss,auc
                  - --learning_rate=${trialParameters.learningRate}
                  - --num_leaves=${trialParameters.numberLeaves}
                  - --num_trees=100
                  - --boosting_type=gbdt
                  - --objective=binary
                  - --metric_freq=1
                  - --is_training_metric=true
                  - --max_bin=255
                  - --data=data/binary.train
                  - --valid_data=data/binary.test
                  - --tree_learner=feature
                  - --feature_fraction=0.8
                  - --bagging_freq=5
                  - --bagging_fraction=0.8
                  - --min_data_in_leaf=50
                  - --min_sum_hessian_in_leaf=50
                  - --is_enable_sparse=true
                  - --use_two_round_loading=false
                  - --is_save_binary_file=false
                  image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
                  imagePullPolicy: Always
                  name: xgboost
                  ports:
                  - containerPort: 9991
                    name: xgboostjob-port
          Worker:
            replicas: 2
            restartPolicy: ExitCode
            template:
              spec:
                containers:
                - args:
                  - --job_type=Train
                  - --metric=binary_logloss,auc
                  - --learning_rate=${trialParameters.learningRate}
                  - --num_leaves=${trialParameters.numberLeaves}
                  - --num_trees=100
                  - --boosting_type=gbdt
                  - --objective=binary
                  - --metric_freq=1
                  - --is_training_metric=true
                  - --max_bin=255
                  - --data=data/binary.train
                  - --valid_data=data/binary.test
                  - --tree_learner=feature
                  - --feature_fraction=0.8
                  - --bagging_freq=5
                  - --bagging_fraction=0.8
                  - --min_data_in_leaf=50
                  - --min_sum_hessian_in_leaf=50
                  - --is_enable_sparse=true
                  - --use_two_round_loading=false
                  - --is_save_binary_file=false
                  image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
                  imagePullPolicy: Always
                  name: xgboost
                  ports:
                  - containerPort: 9991
                    name: xgboostjob-port
status:
  conditions:
  - lastTransitionTime: "2021-11-19T06:32:26Z"
    lastUpdateTime: "2021-11-19T06:32:26Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-11-19T06:32:47Z"
    lastUpdateTime: "2021-11-19T06:32:47Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  runningTrialList:
  - xgboostjob-lightgbm-fxsclkkt
  - xgboostjob-lightgbm-lv8w2tsn
  - xgboostjob-lightgbm-wtjwxpzj
  - xgboostjob-lightgbm-kq8wblpm
  - xgboostjob-lightgbm-htls9mrz
  - xgboostjob-lightgbm-5vgdtcs8
  startTime: "2021-11-19T06:32:26Z"
  trials: 6
  trialsRunning: 6
[root@ip-172-31-38-13 wallace]# kubectl logs pod/katib-controller-6d6fdd9c84-4wnkx
...
{"level":"info","ts":1637303567.322073,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3305497,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.330741,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3357322,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3385773,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-lv8w2tsn","kind":"XGBoostJob","name":"xgboostjob-lightgbm-lv8w2tsn"}
{"level":"info","ts":1637303567.3468902,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-lv8w2tsn"}
{"level":"info","ts":1637303567.3594291,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.373519,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3863587,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3966305,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.396777,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3985944,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-wtjwxpzj","kind":"XGBoostJob","name":"xgboostjob-lightgbm-wtjwxpzj"}
{"level":"info","ts":1637303567.4087048,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-wtjwxpzj"}
{"level":"info","ts":1637303567.4103403,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.419407,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.4195657,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4320982,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-kq8wblpm","kind":"XGBoostJob","name":"xgboostjob-lightgbm-kq8wblpm"}
{"level":"info","ts":1637303567.4334705,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4415863,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-kq8wblpm"}
{"level":"info","ts":1637303567.443757,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.443909,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4563413,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4618397,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-htls9mrz","kind":"XGBoostJob","name":"xgboostjob-lightgbm-htls9mrz"}
{"level":"info","ts":1637303567.4651172,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.4652743,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4728572,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-htls9mrz"}
{"level":"info","ts":1637303567.4796686,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4899561,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-5vgdtcs8","kind":"XGBoostJob","name":"xgboostjob-lightgbm-5vgdtcs8"}
{"level":"info","ts":1637303567.4920142,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4994686,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-5vgdtcs8"}
{"level":"info","ts":1637303567.5010126,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.5011945,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.5156927,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.5287273,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.541383,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.553308,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
[root@ip-172-31-38-13 wallace]# kubectl get pod/xgboostjob-lightgbm-random-67c4f69b9c-69vwz -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"xgboostjob-lightgbm","namespace":"katib"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":6,"metricsCollectorSpec":{"source":{"filter":{"metricsFormat":["(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"]}}},"objective":{"additionalMetricNames":["valid_1 binary_logloss","training auc","training binary_logloss"],"goal":0.99,"objectiveMetricName":"valid_1 auc","type":"maximize"},"parallelTrialCount":7,"parameters":[{"feasibleSpace":{"max":"0.1","min":"0.01"},"name":"lr","parameterType":"double"},{"feasibleSpace":{"max":"60","min":"50","step":"1"},"name":"num-leaves","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"xgboost","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"lr"},{"description":"Number of leaves for one tree","name":"numberLeaves","reference":"num-leaves"}],"trialSpec":{"apiVersion":"xgboostjob.kubeflow.org/v1","kind":"XGBoostJob","spec":{"xgbReplicaSpecs":{"Master":{"replicas":1,"restartPolicy":"Never","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}},"Worker":{"replicas":2,"restartPolicy":"ExitCode","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}}}}}}}}
    kubernetes.io/psp: eks.privileged
    sidecar.istio.io/inject: "false"
  creationTimestamp: "2021-11-19T06:32:26Z"
  generateName: xgboostjob-lightgbm-random-67c4f69b9c-
  labels:
    katib.kubeflow.org/deployment: xgboostjob-lightgbm-random
    katib.kubeflow.org/experiment: xgboostjob-lightgbm
    katib.kubeflow.org/suggestion: xgboostjob-lightgbm
    pod-template-hash: 67c4f69b9c
  name: xgboostjob-lightgbm-random-67c4f69b9c-69vwz
  namespace: katib
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: xgboostjob-lightgbm-random-67c4f69b9c
    uid: 2691ab09-f3dc-4d64-843c-33f93bf878b2
  resourceVersion: "132795355"
  selfLink: /api/v1/namespaces/katib/pods/xgboostjob-lightgbm-random-67c4f69b9c-69vwz
  uid: d73e70b8-6dde-42b5-82e7-68532947afcb
spec:
  containers:
  - image: docker.io/kubeflowkatib/suggestion-hyperopt:latest
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - /bin/grpc_health_probe
        - -addr=:6789
        - -service=manager.v1beta1.Suggestion
      failureThreshold: 12
      initialDelaySeconds: 10
      periodSeconds: 120
      successThreshold: 1
      timeoutSeconds: 1
    name: suggestion
    ports:
    - containerPort: 6789
      name: suggestion-api
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - /bin/grpc_health_probe
        - -addr=:6789
        - -service=manager.v1beta1.Suggestion
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 500m
        ephemeral-storage: 5Gi
        memory: 100Mi
      requests:
        cpu: 50m
        ephemeral-storage: 500Mi
        memory: 10Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-v4f7l
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-172-31-22-247.cn-northwest-1.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-v4f7l
    secret:
      defaultMode: 420
      secretName: default-token-v4f7l
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-11-19T06:32:26Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-11-19T06:32:46Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-11-19T06:32:46Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-11-19T06:32:26Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://76dc050863f5c57f1c0958d312776be72ff98eb041e3ecbc790ff41b0955a954
    image: kubeflowkatib/suggestion-hyperopt:latest
    imageID: docker-pullable://kubeflowkatib/suggestion-hyperopt@sha256:0590de0df777c29814181c8c6bd9f014b20479fc3ad599c367c23b0c12735e8d
    lastState: {}
    name: suggestion
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-11-19T06:32:27Z"
  hostIP: 172.31.22.247
  phase: Running
  podIP: 172.31.22.66
  podIPs:
  - ip: 172.31.22.66
  qosClass: Burstable
  startTime: "2021-11-19T06:32:26Z"

Environment:

  • Kubeflow version (kfctl version):
  • Minikube version (minikube version):
  • Kubernetes version: (use kubectl version):
[root@ip-172-31-38-13 wallace]# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.13-eks-8df270", GitCommit:"8df2700a72a2598fa3a67c05126fa158fd839620", GitTreeState:"clean", BuildDate:"2021-07-31T01:36:57Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.19) exceeds the supported minor version skew of +/-1
  • OS (e.g. from /etc/os-release):
[root@ip-172-31-38-13 wallace]# cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
tenzen-ycommented, Nov 22, 2021

I’m glad I could help you solve your problem.

1reaction
gaocegegecommented, Nov 22, 2021

Cool!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed XGBoost with Kubernetes
Install XGBoost Operator in Kubernetes. XGBoost Operator is designed to manage XGBoost jobs, including job scheduling, monitoring, pods and services recovery ...
Read more >
How can I install only tfjob, mpijob and pytorch operator
Hello Experts - I would like to spawn distributed training using the mpijob and tfjob operators. However, I do not need to install...
Read more >
Troubleshooting the Watson OpenScale service - IBM
I can't see the monitoring metrics when my evaluation finishes ... Check the data mart service pods by running the following command:
Read more >
SageMaker Operators for Kubernetes - Amazon SageMaker
You can install these SageMaker Operators on your Kubernetes cluster in Amazon Elastic Kubernetes Service (Amazon EKS) to create SageMaker jobs natively using...
Read more >
Large Scale Distributed RandomForest with Kubernetes ...
There is only one master pod in the Operator. image.worker: contains the container images for the worker pods. There could be N number...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found