ML-Pipelines API Server and Metadata Writer in CrashLoopBackoff
See original GitHub issueWhat steps did you take
I deployed Kubeflow 1.3 by using the manifests approach. I then repaired an issue with dex running on K8s v1.21
What happened:
The installation succeeded. All processes started up except the two. Both Metadata writer and ml-pipeline crash constantly and are restarted. ML-Pipeline always reports 1 of 2 running. Metadata-writer sometimes appears to be fully running then fails. No other kubeflow pods are having problems like this - even the mysql pod seems stable. I can only assume the failure of the metadata writer is due to a continued failure in ml-pipeline api-server.
The pod keeps getting terminated by something with a reason code of 137. See last image provided for details on the cycle time.
What did you expect to happen:
I expect that the pipeline tools install and operate normally. This has been a consistent problem going back to KF 1.1 with no adequate resolution
Environment:
- How do you deploy Kubeflow Pipelines (KFP)?
I use the kubeflow 1.3 manifests deployment approach
- KFP version:
This install is via the kubeflow 1.3.
- KFP SDK version:
NOT APPLICABLE
Anything else you would like to add:
kubectl logs ml-pipeline-9b68d49cb-x67mp ml-pipeline-api-server
I0723 15:20:24.692579 8 client_manager.go:154] Initializing client manager
I0723 15:20:24.692646 8 config.go:57] Config DBConfig.ExtraParams not specified, skipping
kubectl describe pod ml-pipeline-9b68d49cb-x67mp
Name: ml-pipeline-9b68d49cb-x67mp
Namespace: kubeflow
Priority: 0
Node: cpu-compute-09/10.164.208.67
Start Time: Tue, 20 Jul 2021 21:27:57 -0400
Labels: app=ml-pipeline
app.kubernetes.io/component=ml-pipeline
app.kubernetes.io/name=kubeflow-pipelines
application-crd-id=kubeflow-pipelines
istio.io/rev=default
pod-template-hash=9b68d49cb
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=kubeflow-pipelines
service.istio.io/canonical-revision=latest
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: true
kubectl.kubernetes.io/default-logs-container: ml-pipeline-api-server
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
sidecar.istio.io/status:
{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status: Running
IP: 10.0.4.21
IPs:
IP: 10.0.4.21
Controlled By: ReplicaSet/ml-pipeline-9b68d49cb
Init Containers:
istio-init:
Container ID: docker://db62120288183c6d962e0bfb60db7780fa7bb8c9e231bc9f48976a10c1b29587
Image: docker.io/istio/proxyv2:1.9.0
Image ID: docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 20 Jul 2021 21:28:19 -0400
Finished: Tue, 20 Jul 2021 21:28:19 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
Containers:
ml-pipeline-api-server:
Container ID: docker://0a4b0d31179f67cc38ddb5ebb8eb31b32344c80fe9e4789ef20c073b02c5335b
Image: gcr.io/ml-pipeline/api-server:1.5.0
Image ID: docker-pullable://gcr.io/ml-pipeline/api-server@sha256:0d90705712e201ca7102336e4bd6ff794e7f76facdac2c6e82134294706d78ca
Ports: 8888/TCP, 8887/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Fri, 23 Jul 2021 11:13:49 -0400
Finished: Fri, 23 Jul 2021 11:14:34 -0400
Ready: False
Restart Count: 1117
Requests:
cpu: 250m
memory: 500Mi
Liveness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Readiness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Environment Variables from:
pipeline-api-server-config-dc9hkg52h6 ConfigMap Optional: false
Environment:
KUBEFLOW_USERID_HEADER: kubeflow-userid
KUBEFLOW_USERID_PREFIX:
AUTO_UPDATE_PIPELINE_DEFAULT_VERSION: <set to the key 'autoUpdatePipelineDefaultVersion' of config map 'pipeline-install-config'> Optional: false
POD_NAMESPACE: kubeflow (v1:metadata.namespace)
OBJECTSTORECONFIG_SECURE: false
OBJECTSTORECONFIG_BUCKETNAME: <set to the key 'bucketName' of config map 'pipeline-install-config'> Optional: false
DBCONFIG_USER: <set to the key 'username' in secret 'mysql-secret'> Optional: false
DBCONFIG_PASSWORD: <set to the key 'password' in secret 'mysql-secret'> Optional: false
DBCONFIG_DBNAME: <set to the key 'pipelineDb' of config map 'pipeline-install-config'> Optional: false
DBCONFIG_HOST: <set to the key 'dbHost' of config map 'pipeline-install-config'> Optional: false
DBCONFIG_PORT: <set to the key 'dbPort' of config map 'pipeline-install-config'> Optional: false
OBJECTSTORECONFIG_ACCESSKEY: <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'> Optional: false
OBJECTSTORECONFIG_SECRETACCESSKEY: <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
istio-proxy:
Container ID: docker://6cd34842733729c0743c0ce153a6b15614da748e72a2352616cdf6d10eb9a997
Image: docker.io/istio/proxyv2:1.9.0
Image ID: docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--serviceCluster
ml-pipeline.$(POD_NAMESPACE)
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--concurrency
2
State: Running
Started: Tue, 20 Jul 2021 21:28:36 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: ml-pipeline-9b68d49cb-x67mp (v1:metadata.name)
POD_NAMESPACE: kubeflow (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name'])
CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision'])
PROXY_CONFIG: {}
ISTIO_META_POD_PORTS: [
{"name":"http","containerPort":8888,"protocol":"TCP"}
,{"name":"grpc","containerPort":8887,"protocol":"TCP"}
]
ISTIO_META_APP_CONTAINERS: ml-pipeline-api-server
ISTIO_META_CLUSTER_ID: Kubernetes
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_METAJSON_ANNOTATIONS: {"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"}
ISTIO_META_WORKLOAD_NAME: ml-pipeline
ISTIO_META_OWNER: kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/ml-pipeline
ISTIO_META_MESH_ID: cluster.local
TRUST_DOMAIN: cluster.local
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
/var/run/secrets/tokens from istio-token (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
limits.cpu -> cpu-limit
requests.cpu -> cpu-request
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-8csrh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 12m (x13622 over 2d13h) kubelet Back-off restarting failed container
Warning Unhealthy 8m2s (x9931 over 2d13h) kubelet Readiness probe failed:
Normal Pulled 2m58s (x1116 over 2d13h) kubelet Container image "gcr.io/ml-pipeline/api-server:1.5.0" already present on machine
Metadata-Writer Logs:
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Traceback (most recent call last):
File "/kfp/metadata_writer/metadata_writer.py", line 63, in <module>
mlmd_store = connect_to_mlmd()
File "/kfp/metadata_writer/metadata_helpers.py", line 62, in connect_to_mlmd
raise RuntimeError('Could not connect to the Metadata store.')
RuntimeError: Could not connect to the Metadata store.
Labels
/area backend
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:15 (8 by maintainers)
Top GitHub Comments
When you disable sidecar injection, also find all the destination rules and delete the destination rule for mysql. Otherwise, all other clients will fail to access MySQL assuming mTLS is turned on.
Edit: this is a workaround by pulling MySQL out of the mesh.
Ohh, sorry this fell through the cracks. Let me take a look tmr.