Cannnot create pod when using xgboost operator
See original GitHub issue/kind bug
What steps did you take and what happened: Hi guys, Im trying to run katib with xgboost operater. My katib can work with pytouch, tf, job etc operator but only cannot work with the xgboost operator since it cannot create pod after getting trials. i’m using this yaml file: xgboostjob-lightgbm.yaml https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-training-operator/xgboostjob-lightgbm.yaml i create a seperate katib in namespace katib (using katib-standalone and change cluster role and katib-controller), and install the xgboost operator (can pass the web test). here is the katib and xgboost-operator information:
[root@ip-172-31-38-13 wallace]# kubectl get all
NAME READY STATUS RESTARTS AGE
pod/katib-cert-generator-m2ksx 0/1 Completed 0 4h25m
pod/katib-controller-6d6fdd9c84-4wnkx 1/1 Running 0 4h25m
pod/katib-db-manager-b6f785f69-44wkp 1/1 Running 0 4h25m
pod/katib-mysql-6dcb447c6f-smhk2 1/1 Running 0 4h25m
pod/katib-ui-5767cfccdc-kpd77 1/1 Running 0 4h25m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/katib-controller ClusterIP 10.100.56.32 <none> 443/TCP,8080/TCP 4h25m
service/katib-db-manager ClusterIP 10.100.152.185 <none> 6789/TCP 4h25m
service/katib-mysql ClusterIP 10.100.30.233 <none> 3306/TCP 4h25m
service/katib-ui ClusterIP 10.100.148.82 <none> 80/TCP 4h25m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/katib-controller 1/1 1 1 4h25m
deployment.apps/katib-db-manager 1/1 1 1 4h25m
deployment.apps/katib-mysql 1/1 1 1 4h25m
deployment.apps/katib-ui 1/1 1 1 4h25m
NAME DESIRED CURRENT READY AGE
replicaset.apps/katib-controller-6d6fdd9c84 1 1 1 4h25m
replicaset.apps/katib-db-manager-b6f785f69 1 1 1 4h25m
replicaset.apps/katib-mysql-6dcb447c6f 1 1 1 4h25m
replicaset.apps/katib-ui-5767cfccdc 1 1 1 4h25m
NAME COMPLETIONS DURATION AGE
job.batch/katib-cert-generator 1/1 7s 4h25m
[root@ip-172-31-38-13 wallace]# kubectl get crd xgboostjobs.xgboostjob.kubeflow.org
NAME CREATED AT
xgboostjobs.xgboostjob.kubeflow.org 2021-11-18T15:12:31Z
[root@ip-172-31-38-13 wallace]# kubectl logs $(kubectl get pods -n katib -o name | grep katib-controller) -n katib | grep '"CRD Kind":"XGBoostJob"'
{"level":"info","ts":1637287330.0504189,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"xgboostjob.kubeflow.org","CRD Version":"v1","CRD Kind":"XGBoostJob"}`
`kubectl edit clusterroles xgboost-operator-cluster-role
- apiGroups:
- xgboostjob.kubeflow.org
resources:
- xgboostjobs
- xgboostjobs/status
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
then i apply xgboostjob-lightgbm.yaml:
[root@ip-172-31-38-13 wallace]# kubectl apply -f xgboostjob-lightgbm.yaml
experiment.kubeflow.org/xgboostjob-lightgbm created
the whole process will stuck here:
[root@ip-172-31-38-13 wallace]# kubectl get all
NAME READY STATUS RESTARTS AGE
pod/katib-cert-generator-m2ksx 0/1 Completed 0 4h31m
pod/katib-controller-6d6fdd9c84-4wnkx 1/1 Running 0 4h31m
pod/katib-db-manager-b6f785f69-44wkp 1/1 Running 0 4h31m
pod/katib-mysql-6dcb447c6f-smhk2 1/1 Running 0 4h31m
pod/katib-ui-5767cfccdc-kpd77 1/1 Running 0 4h31m
pod/xgboostjob-lightgbm-random-67c4f69b9c-69vwz 1/1 Running 0 31s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/katib-controller ClusterIP 10.100.56.32 <none> 443/TCP,8080/TCP 4h31m
service/katib-db-manager ClusterIP 10.100.152.185 <none> 6789/TCP 4h31m
service/katib-mysql ClusterIP 10.100.30.233 <none> 3306/TCP 4h31m
service/katib-ui ClusterIP 10.100.148.82 <none> 80/TCP 4h31m
service/xgboostjob-lightgbm-random ClusterIP 10.100.125.219 <none> 6789/TCP 31s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/katib-controller 1/1 1 1 4h31m
deployment.apps/katib-db-manager 1/1 1 1 4h31m
deployment.apps/katib-mysql 1/1 1 1 4h31m
deployment.apps/katib-ui 1/1 1 1 4h31m
deployment.apps/xgboostjob-lightgbm-random 1/1 1 1 31s
NAME DESIRED CURRENT READY AGE
replicaset.apps/katib-controller-6d6fdd9c84 1 1 1 4h31m
replicaset.apps/katib-db-manager-b6f785f69 1 1 1 4h31m
replicaset.apps/katib-mysql-6dcb447c6f 1 1 1 4h31m
replicaset.apps/katib-ui-5767cfccdc 1 1 1 4h31m
replicaset.apps/xgboostjob-lightgbm-random-67c4f69b9c 1 1 1 31s
NAME COMPLETIONS DURATION AGE
job.batch/katib-cert-generator 1/1 7s 4h31m
NAME TYPE STATUS REQUESTED ASSIGNED AGE
suggestion.kubeflow.org/xgboostjob-lightgbm Running True 6 6 31s
NAME TYPE STATUS AGE
experiment.kubeflow.org/xgboostjob-lightgbm Running True 31s
NAME TYPE STATUS AGE
trial.kubeflow.org/xgboostjob-lightgbm-5vgdtcs8 Running True 10s
trial.kubeflow.org/xgboostjob-lightgbm-fxsclkkt Running True 10s
trial.kubeflow.org/xgboostjob-lightgbm-htls9mrz Running True 10s
trial.kubeflow.org/xgboostjob-lightgbm-kq8wblpm Running True 10s
trial.kubeflow.org/xgboostjob-lightgbm-lv8w2tsn Running True 10s
trial.kubeflow.org/xgboostjob-lightgbm-wtjwxpzj Running True 10s
What did you expect to happen: the trials should create pods, however they did not, i do not know where are the mistakes?
Anything else you would like to add: Here are some additional information:
[root@ip-172-31-38-13 wallace]# kubectl get experiment.kubeflow.org/xgboostjob-lightgbm -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"xgboostjob-lightgbm","namespace":"katib"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":6,"metricsCollectorSpec":{"source":{"filter":{"metricsFormat":["(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"]}}},"objective":{"additionalMetricNames":["valid_1 binary_logloss","training auc","training binary_logloss"],"goal":0.99,"objectiveMetricName":"valid_1 auc","type":"maximize"},"parallelTrialCount":7,"parameters":[{"feasibleSpace":{"max":"0.1","min":"0.01"},"name":"lr","parameterType":"double"},{"feasibleSpace":{"max":"60","min":"50","step":"1"},"name":"num-leaves","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"xgboost","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"lr"},{"description":"Number of leaves for one tree","name":"numberLeaves","reference":"num-leaves"}],"trialSpec":{"apiVersion":"xgboostjob.kubeflow.org/v1","kind":"XGBoostJob","spec":{"xgbReplicaSpecs":{"Master":{"replicas":1,"restartPolicy":"Never","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}},"Worker":{"replicas":2,"restartPolicy":"ExitCode","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}}}}}}}}
creationTimestamp: "2021-11-19T06:32:26Z"
finalizers:
- update-prometheus-metrics
generation: 1
name: xgboostjob-lightgbm
namespace: katib
resourceVersion: "132795443"
selfLink: /apis/kubeflow.org/v1beta1/namespaces/katib/experiments/xgboostjob-lightgbm
uid: 10ed205c-14e0-439e-949f-d0649f433ba7
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 3
maxTrialCount: 6
metricsCollectorSpec:
source:
filter:
metricsFormat:
- (\w+\s\w+)\s:\s((-?\d+)(\.\d+)?)
objective:
additionalMetricNames:
- valid_1 binary_logloss
- training auc
- training binary_logloss
goal: 0.99
objectiveMetricName: valid_1 auc
type: maximize
parallelTrialCount: 7
parameters:
- feasibleSpace:
max: "0.1"
min: "0.01"
name: lr
parameterType: double
- feasibleSpace:
max: "60"
min: "50"
step: "1"
name: num-leaves
parameterType: int
trialTemplate:
primaryContainerName: xgboost
trialParameters:
- description: Learning rate for the training model
name: learningRate
reference: lr
- description: Number of leaves for one tree
name: numberLeaves
reference: num-leaves
trialSpec:
apiVersion: xgboostjob.kubeflow.org/v1
kind: XGBoostJob
spec:
xgbReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
imagePullPolicy: Always
name: xgboost
ports:
- containerPort: 9991
name: xgboostjob-port
Worker:
replicas: 2
restartPolicy: ExitCode
template:
spec:
containers:
- args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
imagePullPolicy: Always
name: xgboost
ports:
- containerPort: 9991
name: xgboostjob-port
status:
conditions:
- lastTransitionTime: "2021-11-19T06:32:26Z"
lastUpdateTime: "2021-11-19T06:32:26Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
- lastTransitionTime: "2021-11-19T06:32:47Z"
lastUpdateTime: "2021-11-19T06:32:47Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
observation: {}
runningTrialList:
- xgboostjob-lightgbm-fxsclkkt
- xgboostjob-lightgbm-lv8w2tsn
- xgboostjob-lightgbm-wtjwxpzj
- xgboostjob-lightgbm-kq8wblpm
- xgboostjob-lightgbm-htls9mrz
- xgboostjob-lightgbm-5vgdtcs8
startTime: "2021-11-19T06:32:26Z"
trials: 6
trialsRunning: 6
[root@ip-172-31-38-13 wallace]# kubectl logs pod/katib-controller-6d6fdd9c84-4wnkx
...
{"level":"info","ts":1637303567.322073,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3305497,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.330741,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3357322,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3385773,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-lv8w2tsn","kind":"XGBoostJob","name":"xgboostjob-lightgbm-lv8w2tsn"}
{"level":"info","ts":1637303567.3468902,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-lv8w2tsn"}
{"level":"info","ts":1637303567.3594291,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.373519,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3863587,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3966305,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.396777,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.3985944,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-wtjwxpzj","kind":"XGBoostJob","name":"xgboostjob-lightgbm-wtjwxpzj"}
{"level":"info","ts":1637303567.4087048,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-wtjwxpzj"}
{"level":"info","ts":1637303567.4103403,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.419407,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.4195657,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4320982,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-kq8wblpm","kind":"XGBoostJob","name":"xgboostjob-lightgbm-kq8wblpm"}
{"level":"info","ts":1637303567.4334705,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4415863,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-kq8wblpm"}
{"level":"info","ts":1637303567.443757,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.443909,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4563413,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4618397,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-htls9mrz","kind":"XGBoostJob","name":"xgboostjob-lightgbm-htls9mrz"}
{"level":"info","ts":1637303567.4651172,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.4652743,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4728572,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-htls9mrz"}
{"level":"info","ts":1637303567.4796686,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4899561,"logger":"trial-controller","msg":"Creating Job","Trial":"katib/xgboostjob-lightgbm-5vgdtcs8","kind":"XGBoostJob","name":"xgboostjob-lightgbm-5vgdtcs8"}
{"level":"info","ts":1637303567.4920142,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.4994686,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"katib/xgboostjob-lightgbm-5vgdtcs8"}
{"level":"info","ts":1637303567.5010126,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"katib/xgboostjob-lightgbm","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"xgboostjob-lightgbm\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1637303567.5011945,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.5156927,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.5287273,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.541383,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
{"level":"info","ts":1637303567.553308,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib/xgboostjob-lightgbm","requiredActiveCount":6,"parallelCount":7,"activeCount":6,"completedCount":0}
[root@ip-172-31-38-13 wallace]# kubectl get pod/xgboostjob-lightgbm-random-67c4f69b9c-69vwz -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"xgboostjob-lightgbm","namespace":"katib"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":6,"metricsCollectorSpec":{"source":{"filter":{"metricsFormat":["(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"]}}},"objective":{"additionalMetricNames":["valid_1 binary_logloss","training auc","training binary_logloss"],"goal":0.99,"objectiveMetricName":"valid_1 auc","type":"maximize"},"parallelTrialCount":7,"parameters":[{"feasibleSpace":{"max":"0.1","min":"0.01"},"name":"lr","parameterType":"double"},{"feasibleSpace":{"max":"60","min":"50","step":"1"},"name":"num-leaves","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"xgboost","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"lr"},{"description":"Number of leaves for one tree","name":"numberLeaves","reference":"num-leaves"}],"trialSpec":{"apiVersion":"xgboostjob.kubeflow.org/v1","kind":"XGBoostJob","spec":{"xgbReplicaSpecs":{"Master":{"replicas":1,"restartPolicy":"Never","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}},"Worker":{"replicas":2,"restartPolicy":"ExitCode","template":{"spec":{"containers":[{"args":["--job_type=Train","--metric=binary_logloss,auc","--learning_rate=${trialParameters.learningRate}","--num_leaves=${trialParameters.numberLeaves}","--num_trees=100","--boosting_type=gbdt","--objective=binary","--metric_freq=1","--is_training_metric=true","--max_bin=255","--data=data/binary.train","--valid_data=data/binary.test","--tree_learner=feature","--feature_fraction=0.8","--bagging_freq=5","--bagging_fraction=0.8","--min_data_in_leaf=50","--min_sum_hessian_in_leaf=50","--is_enable_sparse=true","--use_two_round_loading=false","--is_save_binary_file=false"],"image":"docker.io/kubeflowkatib/xgboost-lightgbm:1.0","imagePullPolicy":"Always","name":"xgboost","ports":[{"containerPort":9991,"name":"xgboostjob-port"}]}]}}}}}}}}}
kubernetes.io/psp: eks.privileged
sidecar.istio.io/inject: "false"
creationTimestamp: "2021-11-19T06:32:26Z"
generateName: xgboostjob-lightgbm-random-67c4f69b9c-
labels:
katib.kubeflow.org/deployment: xgboostjob-lightgbm-random
katib.kubeflow.org/experiment: xgboostjob-lightgbm
katib.kubeflow.org/suggestion: xgboostjob-lightgbm
pod-template-hash: 67c4f69b9c
name: xgboostjob-lightgbm-random-67c4f69b9c-69vwz
namespace: katib
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: xgboostjob-lightgbm-random-67c4f69b9c
uid: 2691ab09-f3dc-4d64-843c-33f93bf878b2
resourceVersion: "132795355"
selfLink: /api/v1/namespaces/katib/pods/xgboostjob-lightgbm-random-67c4f69b9c-69vwz
uid: d73e70b8-6dde-42b5-82e7-68532947afcb
spec:
containers:
- image: docker.io/kubeflowkatib/suggestion-hyperopt:latest
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=:6789
- -service=manager.v1beta1.Suggestion
failureThreshold: 12
initialDelaySeconds: 10
periodSeconds: 120
successThreshold: 1
timeoutSeconds: 1
name: suggestion
ports:
- containerPort: 6789
name: suggestion-api
protocol: TCP
readinessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=:6789
- -service=manager.v1beta1.Suggestion
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-v4f7l
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-172-31-22-247.cn-northwest-1.compute.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-v4f7l
secret:
defaultMode: 420
secretName: default-token-v4f7l
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-11-19T06:32:26Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-11-19T06:32:46Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-11-19T06:32:46Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-11-19T06:32:26Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://76dc050863f5c57f1c0958d312776be72ff98eb041e3ecbc790ff41b0955a954
image: kubeflowkatib/suggestion-hyperopt:latest
imageID: docker-pullable://kubeflowkatib/suggestion-hyperopt@sha256:0590de0df777c29814181c8c6bd9f014b20479fc3ad599c367c23b0c12735e8d
lastState: {}
name: suggestion
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-11-19T06:32:27Z"
hostIP: 172.31.22.247
phase: Running
podIP: 172.31.22.66
podIPs:
- ip: 172.31.22.66
qosClass: Burstable
startTime: "2021-11-19T06:32:26Z"
Environment:
- Kubeflow version (
kfctl version
): - Minikube version (
minikube version
): - Kubernetes version: (use
kubectl version
):
[root@ip-172-31-38-13 wallace]# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.13-eks-8df270", GitCommit:"8df2700a72a2598fa3a67c05126fa158fd839620", GitTreeState:"clean", BuildDate:"2021-07-31T01:36:57Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.19) exceeds the supported minor version skew of +/-1
- OS (e.g. from
/etc/os-release
):
[root@ip-172-31-38-13 wallace]# cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
I’m glad I could help you solve your problem.
Cool!