question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

start commond changed by webhook and cause worker pod failed

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] trial pod failed when the trialSpec is set as the following

    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
                args:
                 - cd /code && python3 /code/pytorch_mnist.py --epochs=2 --log-path=/log/mnist.log --lr=${trialParameters.learningRate} --momentum=${trialParameters.momentum} --data_dir=/data
                command:
                  - /bin/bash
                  - -c
                imagePullPolicy: Always
                resources:
                  limits:
                    cpu: "7"
                    memory: 29Gi
                    nvidia.com/gpu: "1"
                  requests:
                    cpu: 6125m
                    memory: 29Gi
                    nvidia.com/gpu: "1"
                volumeMounts:
                - mountPath: /data/
                  name: kubeflow-test-pvc
                  subPath: data/pytorch-mnist
                - mountPath: /code/
                  name: kubeflow-test-pvc
                  subPath: code/                  
                - mountPath: /dev/shm
                  name: dshm
            restartPolicy: Never
            imagePullSecrets:
            - name: default-secret  
            volumes:
            - name: kubeflow-test-pvc
              persistentVolumeClaim:
                claimName: kubeflow-test-pvc
            - emptyDir:
                medium: Memory
              name: dshm 

I check the job’s args and commond in spec.containers

      - args:
        - cd /code && python3 pytorch_mnist.py --epochs=2 --log-path=/log/mnist.log
          --lr=0.027190230182843826 --momentum=0.3976265101287227 --data_dir=/data
        command:
        - /bin/bash
        - -c

but the pod’s args and commond in spec.containers change like this

  - args:
    - /bin/bash -c cd /code && python3 pytorch_mnist.py --epochs=2 --log-path=/log/mnist.log
      --lr=0.027190230182843826 --momentum=0.3976265101287227 --data_dir=/data &&
      echo completed > /log/$$$$.pid
    command:
    - sh
    - -c

the error logs of pod

python3: can't open file 'pytorch_mnist.py': [Errno 2] No such file or directory

It looks like katib webhook changed this value

What did you expect to happen:

Do not modify user settings with args and commond in trialSpec

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Kubeflow version (kfctl version):
  • Minikube version (minikube version):
  • Kubernetes version: (use kubectl version): 1.17
  • OS (e.g. from /etc/os-release):

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
andreyvelichcommented, Nov 19, 2021

@tingweiwu We wrap all container command to args here: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L88-L109.

If the first two values in args are equal to bin or sh, we should populate these values to command (Check this: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L157-L162).

@gaocegege Should we modify this condition to support /bin/bash and /bin/sh also ?

When metric collector kind is StdOut, is it better to change user’s start cmd like

I am not sure if we can use this command. Only if the container command is succeeded, Metrics Collector should search for completed string in $$$$.pid file. Check this: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1beta1/common/pns.go#L151-L155

0reactions
stale[bot]commented, Apr 16, 2022

This issue has been automatically closed because it has not had recent activity. Please comment “/reopen” to reopen it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why do cluster operations fail due to a broken webhook?
Because webhooks can change or reject requests, broken webhooks can impact the functionality of the cluster in various ways, such as preventing you...
Read more >
The Definitive Debugging Guide for the cert-manager ...
This guide helps you debug communication issues between the Kubernetes API server and the cert-manager webhook pod. The error messages ...
Read more >
Troubleshooting | Kyverno
Solution: Delete the Kyverno validating and mutating webhook configurations and then restart Kyverno. Delete the validating and mutating webhook ...
Read more >
Troubleshoot issues on Kubernetes/OpenShift | Dynatrace Docs
If you get this error after applying the DynaKube custom resource, your Kubernetes API server may be configured with a proxy. You need...
Read more >
Resolving sidecar proxy/webhook issues in Anthos Service ...
A configuration issue in one of these webhooks might cause new pods to fail start up, or kubectl apply generating error messages.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found