start commond changed by webhook and cause worker pod failed
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.] trial pod failed when the trialSpec is set as the following
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
args:
- cd /code && python3 /code/pytorch_mnist.py --epochs=2 --log-path=/log/mnist.log --lr=${trialParameters.learningRate} --momentum=${trialParameters.momentum} --data_dir=/data
command:
- /bin/bash
- -c
imagePullPolicy: Always
resources:
limits:
cpu: "7"
memory: 29Gi
nvidia.com/gpu: "1"
requests:
cpu: 6125m
memory: 29Gi
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /data/
name: kubeflow-test-pvc
subPath: data/pytorch-mnist
- mountPath: /code/
name: kubeflow-test-pvc
subPath: code/
- mountPath: /dev/shm
name: dshm
restartPolicy: Never
imagePullSecrets:
- name: default-secret
volumes:
- name: kubeflow-test-pvc
persistentVolumeClaim:
claimName: kubeflow-test-pvc
- emptyDir:
medium: Memory
name: dshm
I check the job’s args and commond in spec.containers
- args:
- cd /code && python3 pytorch_mnist.py --epochs=2 --log-path=/log/mnist.log
--lr=0.027190230182843826 --momentum=0.3976265101287227 --data_dir=/data
command:
- /bin/bash
- -c
but the pod’s args and commond in spec.containers change like this
- args:
- /bin/bash -c cd /code && python3 pytorch_mnist.py --epochs=2 --log-path=/log/mnist.log
--lr=0.027190230182843826 --momentum=0.3976265101287227 --data_dir=/data &&
echo completed > /log/$$$$.pid
command:
- sh
- -c
the error logs of pod
python3: can't open file 'pytorch_mnist.py': [Errno 2] No such file or directory
It looks like katib webhook changed this value
What did you expect to happen:
Do not modify user settings with args and commond in trialSpec
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- Kubeflow version (
kfctl version
): - Minikube version (
minikube version
): - Kubernetes version: (use
kubectl version
): 1.17 - OS (e.g. from
/etc/os-release
):
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Why do cluster operations fail due to a broken webhook?
Because webhooks can change or reject requests, broken webhooks can impact the functionality of the cluster in various ways, such as preventing you...
Read more >The Definitive Debugging Guide for the cert-manager ...
This guide helps you debug communication issues between the Kubernetes API server and the cert-manager webhook pod. The error messages ...
Read more >Troubleshooting | Kyverno
Solution: Delete the Kyverno validating and mutating webhook configurations and then restart Kyverno. Delete the validating and mutating webhook ...
Read more >Troubleshoot issues on Kubernetes/OpenShift | Dynatrace Docs
If you get this error after applying the DynaKube custom resource, your Kubernetes API server may be configured with a proxy. You need...
Read more >Resolving sidecar proxy/webhook issues in Anthos Service ...
A configuration issue in one of these webhooks might cause new pods to fail start up, or kubectl apply generating error messages.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@tingweiwu We wrap all container
command
toargs
here: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L88-L109.If the first two values in
args
are equal tobin
orsh
, we should populate these values tocommand
(Check this: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L157-L162).@gaocegege Should we modify this condition to support
/bin/bash
and/bin/sh
also ?I am not sure if we can use this command. Only if the container command is succeeded, Metrics Collector should search for
completed
string in$$$$.pid
file. Check this: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1beta1/common/pns.go#L151-L155This issue has been automatically closed because it has not had recent activity. Please comment “/reopen” to reopen it.