question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ML-Pipelines API Server and Metadata Writer in CrashLoopBackoff

See original GitHub issue

What steps did you take

I deployed Kubeflow 1.3 by using the manifests approach. I then repaired an issue with dex running on K8s v1.21

What happened:

The installation succeeded. All processes started up except the two. Both Metadata writer and ml-pipeline crash constantly and are restarted. ML-Pipeline always reports 1 of 2 running. Metadata-writer sometimes appears to be fully running then fails. No other kubeflow pods are having problems like this - even the mysql pod seems stable. I can only assume the failure of the metadata writer is due to a continued failure in ml-pipeline api-server.

The pod keeps getting terminated by something with a reason code of 137. See last image provided for details on the cycle time.

What did you expect to happen:

I expect that the pipeline tools install and operate normally. This has been a consistent problem going back to KF 1.1 with no adequate resolution

Environment:

  • How do you deploy Kubeflow Pipelines (KFP)?

I use the kubeflow 1.3 manifests deployment approach

  • KFP version:

This install is via the kubeflow 1.3.

  • KFP SDK version:

NOT APPLICABLE

Anything else you would like to add:

kubectl logs ml-pipeline-9b68d49cb-x67mp ml-pipeline-api-server

I0723 15:20:24.692579       8 client_manager.go:154] Initializing client manager
I0723 15:20:24.692646       8 config.go:57] Config DBConfig.ExtraParams not specified, skipping

kubectl describe pod ml-pipeline-9b68d49cb-x67mp

Name:         ml-pipeline-9b68d49cb-x67mp
Namespace:    kubeflow
Priority:     0
Node:         cpu-compute-09/10.164.208.67
Start Time:   Tue, 20 Jul 2021 21:27:57 -0400
Labels:       app=ml-pipeline
              app.kubernetes.io/component=ml-pipeline
              app.kubernetes.io/name=kubeflow-pipelines
              application-crd-id=kubeflow-pipelines
              istio.io/rev=default
              pod-template-hash=9b68d49cb
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=kubeflow-pipelines
              service.istio.io/canonical-revision=latest
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: true
              kubectl.kubernetes.io/default-logs-container: ml-pipeline-api-server
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status:       Running
IP:           10.0.4.21
IPs:
  IP:           10.0.4.21
Controlled By:  ReplicaSet/ml-pipeline-9b68d49cb
Init Containers:
  istio-init:
    Container ID:  docker://db62120288183c6d962e0bfb60db7780fa7bb8c9e231bc9f48976a10c1b29587
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 20 Jul 2021 21:28:19 -0400
      Finished:     Tue, 20 Jul 2021 21:28:19 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
Containers:
  ml-pipeline-api-server:
    Container ID:   docker://0a4b0d31179f67cc38ddb5ebb8eb31b32344c80fe9e4789ef20c073b02c5335b
    Image:          gcr.io/ml-pipeline/api-server:1.5.0
    Image ID:       docker-pullable://gcr.io/ml-pipeline/api-server@sha256:0d90705712e201ca7102336e4bd6ff794e7f76facdac2c6e82134294706d78ca
    Ports:          8888/TCP, 8887/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Fri, 23 Jul 2021 11:13:49 -0400
      Finished:     Fri, 23 Jul 2021 11:14:34 -0400
    Ready:          False
    Restart Count:  1117
    Requests:
      cpu:      250m
      memory:   500Mi
    Liveness:   exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment Variables from:
      pipeline-api-server-config-dc9hkg52h6  ConfigMap  Optional: false
    Environment:
      KUBEFLOW_USERID_HEADER:                kubeflow-userid
      KUBEFLOW_USERID_PREFIX:                
      AUTO_UPDATE_PIPELINE_DEFAULT_VERSION:  <set to the key 'autoUpdatePipelineDefaultVersion' of config map 'pipeline-install-config'>  Optional: false
      POD_NAMESPACE:                         kubeflow (v1:metadata.namespace)
      OBJECTSTORECONFIG_SECURE:              false
      OBJECTSTORECONFIG_BUCKETNAME:          <set to the key 'bucketName' of config map 'pipeline-install-config'>  Optional: false
      DBCONFIG_USER:                         <set to the key 'username' in secret 'mysql-secret'>                   Optional: false
      DBCONFIG_PASSWORD:                     <set to the key 'password' in secret 'mysql-secret'>                   Optional: false
      DBCONFIG_DBNAME:                       <set to the key 'pipelineDb' of config map 'pipeline-install-config'>  Optional: false
      DBCONFIG_HOST:                         <set to the key 'dbHost' of config map 'pipeline-install-config'>      Optional: false
      DBCONFIG_PORT:                         <set to the key 'dbPort' of config map 'pipeline-install-config'>      Optional: false
      OBJECTSTORECONFIG_ACCESSKEY:           <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'>     Optional: false
      OBJECTSTORECONFIG_SECRETACCESSKEY:     <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'>     Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
  istio-proxy:
    Container ID:  docker://6cd34842733729c0743c0ce153a6b15614da748e72a2352616cdf6d10eb9a997
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      ml-pipeline.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Tue, 20 Jul 2021 21:28:36 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      ml-pipeline-9b68d49cb-x67mp (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {}
                                     
      ISTIO_META_POD_PORTS:          [
                                         {"name":"http","containerPort":8888,"protocol":"TCP"}
                                         ,{"name":"grpc","containerPort":8887,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     ml-pipeline-api-server
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"}
                                     
      ISTIO_META_WORKLOAD_NAME:      ml-pipeline
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/ml-pipeline
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
      /var/run/secrets/tokens from istio-token (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-8csrh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                       From     Message
  ----     ------     ----                      ----     -------
  Warning  BackOff    12m (x13622 over 2d13h)   kubelet  Back-off restarting failed container
  Warning  Unhealthy  8m2s (x9931 over 2d13h)   kubelet  Readiness probe failed:
  Normal   Pulled     2m58s (x1116 over 2d13h)  kubelet  Container image "gcr.io/ml-pipeline/api-server:1.5.0" already present on machine

image

image

Metadata-Writer Logs:

Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 63, in <module>
    mlmd_store = connect_to_mlmd()
  File "/kfp/metadata_writer/metadata_helpers.py", line 62, in connect_to_mlmd
    raise RuntimeError('Could not connect to the Metadata store.')
RuntimeError: Could not connect to the Metadata store.

Labels

/area backend


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:4
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
Bobgycommented, Sep 10, 2021

When you disable sidecar injection, also find all the destination rules and delete the destination rule for mysql. Otherwise, all other clients will fail to access MySQL assuming mTLS is turned on.

Edit: this is a workaround by pulling MySQL out of the mesh.

1reaction
Bobgycommented, Sep 8, 2021

Ohh, sorry this fell through the cracks. Let me take a look tmr.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CrashLoopBackOff when launching notebook from Kubeflow ...
I solved the issue by myself. The reason could be the allocated resources. I started kubernetes server locally again by this command:
Read more >
Kubernetes CrashLoopBackOff: What it is, and how to fix it?
Learn to visualize, alert, and troubleshoot a Kubernetes CrashLoopBackOff: A pod starting, crashing, starting again, and crashing again.
Read more >
Kubeflow第二篇--安装部署(kubenetes v1.18) - CSDN博客
Error from server (NotFound): error when deleting ". ... created role.rbac.authorization.k8s.io/kubeflow-pipelines-metadata-writer-role ...
Read more >
Search OpenShift CI - OKD
#2106770 bug 5 months ago #2101880 bug 4 months ago #1997490 bug 15 months ago #1904485 bug 2 years ago
Read more >
unable to attach or mount volumes - You.com
apiVersion: v1 kind: Pod metadata: name: md-account-6559db8986-cgtqn generateName: md-account-6559db8986- namespace: sdk uid: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found