question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

katib-manager Pod crashes (Kubeflow 0.6)

See original GitHub issue

What steps did you take and what happened: I’ve installed Kubeflow 0.6 on a Kubernetes on-premise cluster (1.14) with MetalLB. Katib-manager pod crashes with the follow log:

E0725 13:27:46.751356       1 interface.go:98] Ping to Katib db failed: dial tcp 10.99.115.60:3306: i/o timeout
E0725 13:27:51.751508       1 interface.go:98] Ping to Katib db failed: dial tcp 10.99.115.60:3306: i/o timeout
E0725 13:27:56.751668       1 interface.go:98] Ping to Katib db failed: dial tcp 10.99.115.60:3306: i/o timeout 

What did you expect to happen: katib-manager pod starts successfully.

Anything else you would like to add: This is the YAML for katib-manager:

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "katib-manager-574c8c67f9-xtz4m",
    "generateName": "katib-manager-574c8c67f9-",
    "namespace": "kubeflow",
    "selfLink": "/api/v1/namespaces/kubeflow/pods/katib-manager-574c8c67f9-xtz4m",
    "uid": "ff8bd34e-aec1-11e9-b38a-f40343df0215",
    "resourceVersion": "5593656",
    "creationTimestamp": "2019-07-25T09:53:05Z",
    "labels": {
      "app": "katib",
      "component": "manager",
      "pod-template-hash": "574c8c67f9"
    },
    "ownerReferences": [
      {
        "apiVersion": "apps/v1",
        "kind": "ReplicaSet",
        "name": "katib-manager-574c8c67f9",
        "uid": "fe768ad0-aec1-11e9-b38a-f40343df0215",
        "controller": true,
        "blockOwnerDeletion": true
      }
    ]
  },
  "spec": {
    "volumes": [
      {
        "name": "default-token-p4pkj",
        "secret": {
          "secretName": "default-token-p4pkj",
          "defaultMode": 420
        }
      }
    ],
    "containers": [
      {
        "name": "katib-manager",
        "image": "gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager:v0.6.0-rc.0",
        "command": [
          "./katib-manager"
        ],
        "ports": [
          {
            "name": "api",
            "containerPort": 6789,
            "protocol": "TCP"
          }
        ],
        "env": [
          {
            "name": "MYSQL_ROOT_PASSWORD",
            "valueFrom": {
              "secretKeyRef": {
                "name": "katib-db-secrets",
                "key": "MYSQL_ROOT_PASSWORD"
              }
            }
          }
        ],
        "resources": {},
        "volumeMounts": [
          {
            "name": "default-token-p4pkj",
            "readOnly": true,
            "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
          }
        ],
        "livenessProbe": {
          "exec": {
            "command": [
              "/bin/grpc_health_probe",
              "-addr=:6789"
            ]
          },
          "initialDelaySeconds": 10,
          "timeoutSeconds": 1,
          "periodSeconds": 10,
          "successThreshold": 1,
          "failureThreshold": 3
        },
        "readinessProbe": {
          "exec": {
            "command": [
              "/bin/grpc_health_probe",
              "-addr=:6789"
            ]
          },
          "initialDelaySeconds": 5,
          "timeoutSeconds": 1,
          "periodSeconds": 10,
          "successThreshold": 1,
          "failureThreshold": 3
        },
        "terminationMessagePath": "/dev/termination-log",
        "terminationMessagePolicy": "File",
        "imagePullPolicy": "IfNotPresent"
      }
    ],
    "restartPolicy": "Always",
    "terminationGracePeriodSeconds": 30,
    "dnsPolicy": "ClusterFirst",
    "serviceAccountName": "default",
    "serviceAccount": "default",
    "nodeName": "moonshot01",
    "securityContext": {},
    "schedulerName": "default-scheduler",
    "tolerations": [
      {
        "key": "node.kubernetes.io/not-ready",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      },
      {
        "key": "node.kubernetes.io/unreachable",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      }
    ],
    "priority": 0,
    "enableServiceLinks": true
  },
  "status": {
    "phase": "Running",
    "conditions": [
      {
        "type": "Initialized",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:05Z"
      },
      {
        "type": "Ready",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:05Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [katib-manager]"
      },
      {
        "type": "ContainersReady",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:05Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [katib-manager]"
      },
      {
        "type": "PodScheduled",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:05Z"
      }
    ],
    "hostIP": "192.168.110.111",
    "podIP": "10.244.1.127",
    "startTime": "2019-07-25T09:53:05Z",
    "containerStatuses": [
      {
        "name": "katib-manager",
        "state": {
          "waiting": {
            "reason": "CrashLoopBackOff",
            "message": "Back-off 5m0s restarting failed container=katib-manager pod=katib-manager-574c8c67f9-xtz4m_kubeflow(ff8bd34e-aec1-11e9-b38a-f40343df0215)"
          }
        },
        "lastState": {
          "terminated": {
            "exitCode": 2,
            "reason": "Error",
            "startedAt": "2019-07-25T13:33:51Z",
            "finishedAt": "2019-07-25T13:34:31Z",
            "containerID": "docker://0421271e49588add3924cb7cc39c203eb810a7ee14c8e68108fd690f207dccdd"
          }
        },
        "ready": false,
        "restartCount": 73,
        "image": "gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager:v0.6.0-rc.0",
        "imageID": "docker-pullable://gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager@sha256:8dbe595c3a241ce65d29afb87a99453461b2c82338e54135dc8dfb4cb5ac8fa6",
        "containerID": "docker://0421271e49588add3924cb7cc39c203eb810a7ee14c8e68108fd690f207dccdd"
      }
    ],
    "qosClass": "BestEffort"
  }
}

YAML for katib-db:

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "katib-db-8598468fd8-5tpl2",
    "generateName": "katib-db-8598468fd8-",
    "namespace": "kubeflow",
    "selfLink": "/api/v1/namespaces/kubeflow/pods/katib-db-8598468fd8-5tpl2",
    "uid": "ff05d7e8-aec1-11e9-b38a-f40343df0215",
    "resourceVersion": "5562403",
    "creationTimestamp": "2019-07-25T09:53:04Z",
    "labels": {
      "app": "katib",
      "component": "db",
      "pod-template-hash": "8598468fd8"
    },
    "ownerReferences": [
      {
        "apiVersion": "apps/v1",
        "kind": "ReplicaSet",
        "name": "katib-db-8598468fd8",
        "uid": "fe6c21f4-aec1-11e9-b38a-f40343df0215",
        "controller": true,
        "blockOwnerDeletion": true
      }
    ]
  },
  "spec": {
    "volumes": [
      {
        "name": "katib-mysql",
        "persistentVolumeClaim": {
          "claimName": "katib-mysql"
        }
      },
      {
        "name": "default-token-p4pkj",
        "secret": {
          "secretName": "default-token-p4pkj",
          "defaultMode": 420
        }
      }
    ],
    "containers": [
      {
        "name": "katib-db",
        "image": "mysql:8.0.3",
        "args": [
          "--datadir",
          "/var/lib/mysql/datadir"
        ],
        "ports": [
          {
            "name": "dbapi",
            "containerPort": 3306,
            "protocol": "TCP"
          }
        ],
        "env": [
          {
            "name": "MYSQL_ROOT_PASSWORD",
            "valueFrom": {
              "secretKeyRef": {
                "name": "katib-db-secrets",
                "key": "MYSQL_ROOT_PASSWORD"
              }
            }
          },
          {
            "name": "MYSQL_ALLOW_EMPTY_PASSWORD",
            "value": "true"
          },
          {
            "name": "MYSQL_DATABASE",
            "value": "katib"
          }
        ],
        "resources": {},
        "volumeMounts": [
          {
            "name": "katib-mysql",
            "mountPath": "/var/lib/mysql"
          },
          {
            "name": "default-token-p4pkj",
            "readOnly": true,
            "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
          }
        ],
        "readinessProbe": {
          "exec": {
            "command": [
              "/bin/bash",
              "-c",
              "mysql -D $$MYSQL_DATABASE -p$$MYSQL_ROOT_PASSWORD -e 'SELECT 1'"
            ]
          },
          "initialDelaySeconds": 5,
          "timeoutSeconds": 1,
          "periodSeconds": 2,
          "successThreshold": 1,
          "failureThreshold": 3
        },
        "terminationMessagePath": "/dev/termination-log",
        "terminationMessagePolicy": "File",
        "imagePullPolicy": "IfNotPresent"
      }
    ],
    "restartPolicy": "Always",
    "terminationGracePeriodSeconds": 30,
    "dnsPolicy": "ClusterFirst",
    "serviceAccountName": "default",
    "serviceAccount": "default",
    "nodeName": "moonshot01",
    "securityContext": {},
    "schedulerName": "default-scheduler",
    "tolerations": [
      {
        "key": "node.kubernetes.io/not-ready",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      },
      {
        "key": "node.kubernetes.io/unreachable",
        "operator": "Exists",
        "effect": "NoExecute",
        "tolerationSeconds": 300
      }
    ],
    "priority": 0,
    "enableServiceLinks": true
  },
  "status": {
    "phase": "Running",
    "conditions": [
      {
        "type": "Initialized",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:04Z"
      },
      {
        "type": "Ready",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:25Z"
      },
      {
        "type": "ContainersReady",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:25Z"
      },
      {
        "type": "PodScheduled",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2019-07-25T09:53:04Z"
      }
    ],
    "hostIP": "192.168.110.111",
    "podIP": "10.244.1.114",
    "startTime": "2019-07-25T09:53:04Z",
    "containerStatuses": [
      {
        "name": "katib-db",
        "state": {
          "running": {
            "startedAt": "2019-07-25T09:53:09Z"
          }
        },
        "lastState": {},
        "ready": true,
        "restartCount": 0,
        "image": "mysql:8.0.3",
        "imageID": "docker-pullable://mysql@sha256:cf12c8d3b7bcff1c4395305e5a39e7d912bf81c59938a4f1c65c91ced66f985b",
        "containerID": "docker://58807248eb9aaaf63bcdfe06eefcf8c5e4c2acfbe02021c1d8e55180a288936f"
      }
    ],
    "qosClass": "BestEffort"
  }
}

This the log from katib-db Pod:

 2019-07-25T09:53:23.831752Z 0 [Note] --secure-file-priv is set to NULL. Operations related to importing and exporting data are disabled
2019-07-25T09:53:23.831786Z 0 [Note] /usr/sbin/mysqld (mysqld 8.0.3-rc-log) starting as process 1 ...
2019-07-25T09:53:23.833987Z 0 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=katib-db-8598468fd8-5tpl2-bin' to avoid this problem.
2019-07-25T09:53:23.836253Z 0 [Note] InnoDB: Using Linux native AIO
2019-07-25T09:53:23.836374Z 0 [Note] Plugin 'FEDERATED' is disabled.
2019-07-25T09:53:23.840007Z 1 [Note] InnoDB: PUNCH HOLE support available
2019-07-25T09:53:23.840031Z 1 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-07-25T09:53:23.840036Z 1 [Note] InnoDB: Uses event mutexes
2019-07-25T09:53:23.840040Z 1 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2019-07-25T09:53:23.840046Z 1 [Note] InnoDB: Compressed tables use zlib 1.2.11
2019-07-25T09:53:23.840409Z 1 [Note] InnoDB: Number of pools: 1
2019-07-25T09:53:23.840522Z 1 [Note] InnoDB: Using CPU crc32 instructions
2019-07-25T09:53:23.842398Z 1 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2019-07-25T09:53:23.855511Z 1 [Note] InnoDB: Completed initialization of buffer pool
2019-07-25T09:53:23.857171Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2019-07-25T09:53:23.871940Z 1 [Note] InnoDB: Using 'tablespaces.open.2' max LSN: 15050811
2019-07-25T09:53:23.874918Z 1 [Note] InnoDB: Applying a batch of 0 redo log records ...
2019-07-25T09:53:23.874936Z 1 [Note] InnoDB: Apply batch completed!
2019-07-25T09:53:23.875842Z 1 [Note] InnoDB: Opened 2 existing undo tablespaces.
2019-07-25T09:53:23.886778Z 1 [Note] InnoDB: Creating shared tablespace for temporary tables
2019-07-25T09:53:23.886857Z 1 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2019-07-25T09:53:23.919128Z 1 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2019-07-25T09:53:23.923261Z 1 [Note] InnoDB: Created 128 and tracked 128 new rollback segment(s) in the temporary tablespace. 128 are now active.
2019-07-25T09:53:23.923455Z 1 [Note] InnoDB: 8.0.3 started; log sequence number 26008934
2019-07-25T09:53:24.012041Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/datadir/ib_buffer_pool
2019-07-25T09:53:24.023296Z 0 [Note] InnoDB: Buffer pool(s) load completed at 190725  9:53:24
2019-07-25T09:53:24.042511Z 1 [Note] Found data dictionary with version 1
2019-07-25T09:53:24.080051Z 0 [Note] InnoDB: DDL log recovery : begin
2019-07-25T09:53:24.080107Z 0 [Note] InnoDB: DDL log recovery : end
2019-07-25T09:53:24.080251Z 0 [Note] InnoDB: Waiting for purge to start
2019-07-25T09:53:24.140632Z 0 [Warning] You have not provided a mandatory server-id. Servers in a replication topology must have unique server-ids. Please refer to the proper server start-up parameters documentation.
2019-07-25T09:53:24.164594Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2019-07-25T09:53:24.164841Z 0 [Warning] CA certificate ca.pem is self signed.
2019-07-25T09:53:24.167219Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2019-07-25T09:53:24.170898Z 0 [Note] IPv6 is available.
2019-07-25T09:53:24.170925Z 0 [Note]   - '::' resolves to '::';
2019-07-25T09:53:24.170983Z 0 [Note] Server socket created on IP: '::'.
2019-07-25T09:53:24.186265Z 0 [Warning] 'user' entry 'mysql.session@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186297Z 0 [Warning] 'user' entry 'mysql.sys@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186310Z 0 [Warning] 'user' entry 'root@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186560Z 0 [Warning] 'db' entry 'performance_schema mysql.session@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186587Z 0 [Warning] 'db' entry 'sys mysql.sys@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186604Z 0 [Warning] 'proxies_priv' entry '@ root@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.192401Z 0 [Warning] 'tables_priv' entry 'user mysql.session@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.192426Z 0 [Warning] 'tables_priv' entry 'sys_config mysql.sys@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.199342Z 4 [Note] Event Scheduler: scheduler thread started with id 4
2019-07-25T09:53:24.199935Z 0 [Note] /usr/sbin/mysqld: ready for connections. Version: '8.0.3-rc-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)  

Environment:

  • Kubeflow version 0.6
  • Kubernetes version 1.14 on-premise (with MetalLB)
  • OS : CentOS 7.6

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

5reactions
wdhortoncommented, Oct 8, 2019

Facing a similar issue with my AWS EKS installation, seeing timeout on katib-manager even though katib-db is running

1reaction
kimchitsigaicommented, Jul 26, 2019

Apologies. Output of nslookup from the pod:

$ kubectl run curl --image=radial/busyboxplus:curl -n kubeflow -i --tty
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
Error from server (AlreadyExists): deployments.apps "curl" already exists
[rim@moonshot-master ~]$ kubectl run curl --image=radial/busyboxplus:curl -n kubeflow -i --tty
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
[ root@curl-66bdcf564-4zfvj:/ ]$ nslookup katib-db
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      katib-db
Address 1: 10.99.115.60 katib-db.kubeflow.svc.cluster.local
[ root@curl-66bdcf564-4zfvj:/ ]$
[ root@curl-66bdcf564-4zfvj:/ ]$ nslookup katib-db.kubeflow.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      katib-db.kubeflow.svc.cluster.local
Address 1: 10.99.115.60 katib-db.kubeflow.svc.cluster.local
Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting
One common error is not being able to schedule the pod because there aren't enough resources in the cluster. Pods stuck in Pending...
Read more >
Local Deployment
This guide shows how to deploy Kubeflow Pipelines standalone on a local Kubernetes cluster using: kind; K3s; K3s on Windows Subsystem for ...
Read more >
Kubeflow 1.4
Notebooks Working Group, Admission Webhook (PodDefaults), v1.4.0. Central Dashboard, v1.4.0. Jupyter Web App, v1.4.0.
Read more >
Installing Kubeflow
Verify the Kubeflow deployment by monitoring the pods in the ... kubectl get pod -n ${KUBEFLOW_NAMESPACE} NAME READY STATUS RESTARTS AGE ...
Read more >
Choosing an Argo Workflows Executor
An Argo workflow executor is a process that conforms to a specific interface that allows Argo to perform certain actions like monitoring pod...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found