katib-manager Pod crashes (Kubeflow 0.6)
See original GitHub issueWhat steps did you take and what happened: I’ve installed Kubeflow 0.6 on a Kubernetes on-premise cluster (1.14) with MetalLB. Katib-manager pod crashes with the follow log:
E0725 13:27:46.751356 1 interface.go:98] Ping to Katib db failed: dial tcp 10.99.115.60:3306: i/o timeout
E0725 13:27:51.751508 1 interface.go:98] Ping to Katib db failed: dial tcp 10.99.115.60:3306: i/o timeout
E0725 13:27:56.751668 1 interface.go:98] Ping to Katib db failed: dial tcp 10.99.115.60:3306: i/o timeout
What did you expect to happen: katib-manager pod starts successfully.
Anything else you would like to add: This is the YAML for katib-manager:
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "katib-manager-574c8c67f9-xtz4m",
"generateName": "katib-manager-574c8c67f9-",
"namespace": "kubeflow",
"selfLink": "/api/v1/namespaces/kubeflow/pods/katib-manager-574c8c67f9-xtz4m",
"uid": "ff8bd34e-aec1-11e9-b38a-f40343df0215",
"resourceVersion": "5593656",
"creationTimestamp": "2019-07-25T09:53:05Z",
"labels": {
"app": "katib",
"component": "manager",
"pod-template-hash": "574c8c67f9"
},
"ownerReferences": [
{
"apiVersion": "apps/v1",
"kind": "ReplicaSet",
"name": "katib-manager-574c8c67f9",
"uid": "fe768ad0-aec1-11e9-b38a-f40343df0215",
"controller": true,
"blockOwnerDeletion": true
}
]
},
"spec": {
"volumes": [
{
"name": "default-token-p4pkj",
"secret": {
"secretName": "default-token-p4pkj",
"defaultMode": 420
}
}
],
"containers": [
{
"name": "katib-manager",
"image": "gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager:v0.6.0-rc.0",
"command": [
"./katib-manager"
],
"ports": [
{
"name": "api",
"containerPort": 6789,
"protocol": "TCP"
}
],
"env": [
{
"name": "MYSQL_ROOT_PASSWORD",
"valueFrom": {
"secretKeyRef": {
"name": "katib-db-secrets",
"key": "MYSQL_ROOT_PASSWORD"
}
}
}
],
"resources": {},
"volumeMounts": [
{
"name": "default-token-p4pkj",
"readOnly": true,
"mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
}
],
"livenessProbe": {
"exec": {
"command": [
"/bin/grpc_health_probe",
"-addr=:6789"
]
},
"initialDelaySeconds": 10,
"timeoutSeconds": 1,
"periodSeconds": 10,
"successThreshold": 1,
"failureThreshold": 3
},
"readinessProbe": {
"exec": {
"command": [
"/bin/grpc_health_probe",
"-addr=:6789"
]
},
"initialDelaySeconds": 5,
"timeoutSeconds": 1,
"periodSeconds": 10,
"successThreshold": 1,
"failureThreshold": 3
},
"terminationMessagePath": "/dev/termination-log",
"terminationMessagePolicy": "File",
"imagePullPolicy": "IfNotPresent"
}
],
"restartPolicy": "Always",
"terminationGracePeriodSeconds": 30,
"dnsPolicy": "ClusterFirst",
"serviceAccountName": "default",
"serviceAccount": "default",
"nodeName": "moonshot01",
"securityContext": {},
"schedulerName": "default-scheduler",
"tolerations": [
{
"key": "node.kubernetes.io/not-ready",
"operator": "Exists",
"effect": "NoExecute",
"tolerationSeconds": 300
},
{
"key": "node.kubernetes.io/unreachable",
"operator": "Exists",
"effect": "NoExecute",
"tolerationSeconds": 300
}
],
"priority": 0,
"enableServiceLinks": true
},
"status": {
"phase": "Running",
"conditions": [
{
"type": "Initialized",
"status": "True",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:05Z"
},
{
"type": "Ready",
"status": "False",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:05Z",
"reason": "ContainersNotReady",
"message": "containers with unready status: [katib-manager]"
},
{
"type": "ContainersReady",
"status": "False",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:05Z",
"reason": "ContainersNotReady",
"message": "containers with unready status: [katib-manager]"
},
{
"type": "PodScheduled",
"status": "True",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:05Z"
}
],
"hostIP": "192.168.110.111",
"podIP": "10.244.1.127",
"startTime": "2019-07-25T09:53:05Z",
"containerStatuses": [
{
"name": "katib-manager",
"state": {
"waiting": {
"reason": "CrashLoopBackOff",
"message": "Back-off 5m0s restarting failed container=katib-manager pod=katib-manager-574c8c67f9-xtz4m_kubeflow(ff8bd34e-aec1-11e9-b38a-f40343df0215)"
}
},
"lastState": {
"terminated": {
"exitCode": 2,
"reason": "Error",
"startedAt": "2019-07-25T13:33:51Z",
"finishedAt": "2019-07-25T13:34:31Z",
"containerID": "docker://0421271e49588add3924cb7cc39c203eb810a7ee14c8e68108fd690f207dccdd"
}
},
"ready": false,
"restartCount": 73,
"image": "gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager:v0.6.0-rc.0",
"imageID": "docker-pullable://gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager@sha256:8dbe595c3a241ce65d29afb87a99453461b2c82338e54135dc8dfb4cb5ac8fa6",
"containerID": "docker://0421271e49588add3924cb7cc39c203eb810a7ee14c8e68108fd690f207dccdd"
}
],
"qosClass": "BestEffort"
}
}
YAML for katib-db:
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "katib-db-8598468fd8-5tpl2",
"generateName": "katib-db-8598468fd8-",
"namespace": "kubeflow",
"selfLink": "/api/v1/namespaces/kubeflow/pods/katib-db-8598468fd8-5tpl2",
"uid": "ff05d7e8-aec1-11e9-b38a-f40343df0215",
"resourceVersion": "5562403",
"creationTimestamp": "2019-07-25T09:53:04Z",
"labels": {
"app": "katib",
"component": "db",
"pod-template-hash": "8598468fd8"
},
"ownerReferences": [
{
"apiVersion": "apps/v1",
"kind": "ReplicaSet",
"name": "katib-db-8598468fd8",
"uid": "fe6c21f4-aec1-11e9-b38a-f40343df0215",
"controller": true,
"blockOwnerDeletion": true
}
]
},
"spec": {
"volumes": [
{
"name": "katib-mysql",
"persistentVolumeClaim": {
"claimName": "katib-mysql"
}
},
{
"name": "default-token-p4pkj",
"secret": {
"secretName": "default-token-p4pkj",
"defaultMode": 420
}
}
],
"containers": [
{
"name": "katib-db",
"image": "mysql:8.0.3",
"args": [
"--datadir",
"/var/lib/mysql/datadir"
],
"ports": [
{
"name": "dbapi",
"containerPort": 3306,
"protocol": "TCP"
}
],
"env": [
{
"name": "MYSQL_ROOT_PASSWORD",
"valueFrom": {
"secretKeyRef": {
"name": "katib-db-secrets",
"key": "MYSQL_ROOT_PASSWORD"
}
}
},
{
"name": "MYSQL_ALLOW_EMPTY_PASSWORD",
"value": "true"
},
{
"name": "MYSQL_DATABASE",
"value": "katib"
}
],
"resources": {},
"volumeMounts": [
{
"name": "katib-mysql",
"mountPath": "/var/lib/mysql"
},
{
"name": "default-token-p4pkj",
"readOnly": true,
"mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
}
],
"readinessProbe": {
"exec": {
"command": [
"/bin/bash",
"-c",
"mysql -D $$MYSQL_DATABASE -p$$MYSQL_ROOT_PASSWORD -e 'SELECT 1'"
]
},
"initialDelaySeconds": 5,
"timeoutSeconds": 1,
"periodSeconds": 2,
"successThreshold": 1,
"failureThreshold": 3
},
"terminationMessagePath": "/dev/termination-log",
"terminationMessagePolicy": "File",
"imagePullPolicy": "IfNotPresent"
}
],
"restartPolicy": "Always",
"terminationGracePeriodSeconds": 30,
"dnsPolicy": "ClusterFirst",
"serviceAccountName": "default",
"serviceAccount": "default",
"nodeName": "moonshot01",
"securityContext": {},
"schedulerName": "default-scheduler",
"tolerations": [
{
"key": "node.kubernetes.io/not-ready",
"operator": "Exists",
"effect": "NoExecute",
"tolerationSeconds": 300
},
{
"key": "node.kubernetes.io/unreachable",
"operator": "Exists",
"effect": "NoExecute",
"tolerationSeconds": 300
}
],
"priority": 0,
"enableServiceLinks": true
},
"status": {
"phase": "Running",
"conditions": [
{
"type": "Initialized",
"status": "True",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:04Z"
},
{
"type": "Ready",
"status": "True",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:25Z"
},
{
"type": "ContainersReady",
"status": "True",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:25Z"
},
{
"type": "PodScheduled",
"status": "True",
"lastProbeTime": null,
"lastTransitionTime": "2019-07-25T09:53:04Z"
}
],
"hostIP": "192.168.110.111",
"podIP": "10.244.1.114",
"startTime": "2019-07-25T09:53:04Z",
"containerStatuses": [
{
"name": "katib-db",
"state": {
"running": {
"startedAt": "2019-07-25T09:53:09Z"
}
},
"lastState": {},
"ready": true,
"restartCount": 0,
"image": "mysql:8.0.3",
"imageID": "docker-pullable://mysql@sha256:cf12c8d3b7bcff1c4395305e5a39e7d912bf81c59938a4f1c65c91ced66f985b",
"containerID": "docker://58807248eb9aaaf63bcdfe06eefcf8c5e4c2acfbe02021c1d8e55180a288936f"
}
],
"qosClass": "BestEffort"
}
}
This the log from katib-db Pod:
2019-07-25T09:53:23.831752Z 0 [Note] --secure-file-priv is set to NULL. Operations related to importing and exporting data are disabled
2019-07-25T09:53:23.831786Z 0 [Note] /usr/sbin/mysqld (mysqld 8.0.3-rc-log) starting as process 1 ...
2019-07-25T09:53:23.833987Z 0 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=katib-db-8598468fd8-5tpl2-bin' to avoid this problem.
2019-07-25T09:53:23.836253Z 0 [Note] InnoDB: Using Linux native AIO
2019-07-25T09:53:23.836374Z 0 [Note] Plugin 'FEDERATED' is disabled.
2019-07-25T09:53:23.840007Z 1 [Note] InnoDB: PUNCH HOLE support available
2019-07-25T09:53:23.840031Z 1 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-07-25T09:53:23.840036Z 1 [Note] InnoDB: Uses event mutexes
2019-07-25T09:53:23.840040Z 1 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2019-07-25T09:53:23.840046Z 1 [Note] InnoDB: Compressed tables use zlib 1.2.11
2019-07-25T09:53:23.840409Z 1 [Note] InnoDB: Number of pools: 1
2019-07-25T09:53:23.840522Z 1 [Note] InnoDB: Using CPU crc32 instructions
2019-07-25T09:53:23.842398Z 1 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2019-07-25T09:53:23.855511Z 1 [Note] InnoDB: Completed initialization of buffer pool
2019-07-25T09:53:23.857171Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2019-07-25T09:53:23.871940Z 1 [Note] InnoDB: Using 'tablespaces.open.2' max LSN: 15050811
2019-07-25T09:53:23.874918Z 1 [Note] InnoDB: Applying a batch of 0 redo log records ...
2019-07-25T09:53:23.874936Z 1 [Note] InnoDB: Apply batch completed!
2019-07-25T09:53:23.875842Z 1 [Note] InnoDB: Opened 2 existing undo tablespaces.
2019-07-25T09:53:23.886778Z 1 [Note] InnoDB: Creating shared tablespace for temporary tables
2019-07-25T09:53:23.886857Z 1 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2019-07-25T09:53:23.919128Z 1 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2019-07-25T09:53:23.923261Z 1 [Note] InnoDB: Created 128 and tracked 128 new rollback segment(s) in the temporary tablespace. 128 are now active.
2019-07-25T09:53:23.923455Z 1 [Note] InnoDB: 8.0.3 started; log sequence number 26008934
2019-07-25T09:53:24.012041Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/datadir/ib_buffer_pool
2019-07-25T09:53:24.023296Z 0 [Note] InnoDB: Buffer pool(s) load completed at 190725 9:53:24
2019-07-25T09:53:24.042511Z 1 [Note] Found data dictionary with version 1
2019-07-25T09:53:24.080051Z 0 [Note] InnoDB: DDL log recovery : begin
2019-07-25T09:53:24.080107Z 0 [Note] InnoDB: DDL log recovery : end
2019-07-25T09:53:24.080251Z 0 [Note] InnoDB: Waiting for purge to start
2019-07-25T09:53:24.140632Z 0 [Warning] You have not provided a mandatory server-id. Servers in a replication topology must have unique server-ids. Please refer to the proper server start-up parameters documentation.
2019-07-25T09:53:24.164594Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2019-07-25T09:53:24.164841Z 0 [Warning] CA certificate ca.pem is self signed.
2019-07-25T09:53:24.167219Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2019-07-25T09:53:24.170898Z 0 [Note] IPv6 is available.
2019-07-25T09:53:24.170925Z 0 [Note] - '::' resolves to '::';
2019-07-25T09:53:24.170983Z 0 [Note] Server socket created on IP: '::'.
2019-07-25T09:53:24.186265Z 0 [Warning] 'user' entry 'mysql.session@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186297Z 0 [Warning] 'user' entry 'mysql.sys@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186310Z 0 [Warning] 'user' entry 'root@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186560Z 0 [Warning] 'db' entry 'performance_schema mysql.session@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186587Z 0 [Warning] 'db' entry 'sys mysql.sys@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.186604Z 0 [Warning] 'proxies_priv' entry '@ root@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.192401Z 0 [Warning] 'tables_priv' entry 'user mysql.session@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.192426Z 0 [Warning] 'tables_priv' entry 'sys_config mysql.sys@localhost' ignored in --skip-name-resolve mode.
2019-07-25T09:53:24.199342Z 4 [Note] Event Scheduler: scheduler thread started with id 4
2019-07-25T09:53:24.199935Z 0 [Note] /usr/sbin/mysqld: ready for connections. Version: '8.0.3-rc-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 MySQL Community Server (GPL)
Environment:
- Kubeflow version 0.6
- Kubernetes version 1.14 on-premise (with MetalLB)
- OS : CentOS 7.6
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (5 by maintainers)
Top Results From Across the Web
Troubleshooting
One common error is not being able to schedule the pod because there aren't enough resources in the cluster. Pods stuck in Pending...
Read more >Local Deployment
This guide shows how to deploy Kubeflow Pipelines standalone on a local Kubernetes cluster using: kind; K3s; K3s on Windows Subsystem for ...
Read more >Kubeflow 1.4
Notebooks Working Group, Admission Webhook (PodDefaults), v1.4.0. Central Dashboard, v1.4.0. Jupyter Web App, v1.4.0.
Read more >Installing Kubeflow
Verify the Kubeflow deployment by monitoring the pods in the ... kubectl get pod -n ${KUBEFLOW_NAMESPACE} NAME READY STATUS RESTARTS AGE ...
Read more >Choosing an Argo Workflows Executor
An Argo workflow executor is a process that conforms to a specific interface that allows Argo to perform certain actions like monitoring pod...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Facing a similar issue with my AWS EKS installation, seeing timeout on katib-manager even though katib-db is running
Apologies. Output of nslookup from the pod: