question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Storage initializer goes into CrashLoopBackOff, can't copy model from KMS enabled S3 bucket

See original GitHub issue

/kind bug

What steps did you take and what happened: Created an inference service using the following:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: abalone-xgboost
  namespace: kubeflow
spec:
  predictor:
    serviceAccountName: pipeline-runner
    xgboost:
      protocolVersion: "v2"
      storageUri: "s3://bucket-name/abalone-xgboost/model.tar.gz"

The pipeline-runner service account consists of a role (IRSA) and is configured as follows:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/kfp-example-pod-role
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"application-crd-id":"kubeflow-pipelines"},"name":"pipeline-runner","namespace":"kubeflow"}}
  creationTimestamp: "2021-10-01T13:38:27Z"
  labels:
    application-crd-id: kubeflow-pipelines
  name: pipeline-runner
  namespace: kubeflow
  ownerReferences:
  - apiVersion: app.k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: false
    kind: Application
    name: pipeline
    uid: 072e21eb-eaa7-4b3b-a353-4b2342e23477
  resourceVersion: "712344"
  selfLink: /api/v1/namespaces/kubeflow/serviceaccounts/pipeline-runner
  uid: d123b35b-9232-4379-b133-aee3245e34ff2
secrets:
- name: pipeline-runner-token-bn2hf

The role kfp-example-pod-role has full permisisons to read S3 buckets and objects and even KMS permission.

[A clear and concise description of what the bug is.] When I apply the inference service defined above, the deployment is in 0/1 state, with the pods in CrashLoopBackOff state. When I check the logs of storage-initializer inside the pod, the following log is what I get.

/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  "update your install command.", FutureWarning)
[I 220121 09:47:24 initializer-entrypoint:13] Initializing, args: src_uri [s3://bucket-name/abalone-xgboost/model.tar.gz] dest_path[ [/mnt/models]
[I 220121 09:47:24 storage:52] Copying contents of s3://bucket-name/abalone-xgboost/model.tar.gz to local
Traceback (most recent call last):
  File "/storage-initializer/scripts/initializer-entrypoint", line 14, in <module>
    kserve.Storage.download(src_uri, dest_path)
  File "/usr/local/lib/python3.7/site-packages/kserve/storage.py", line 69, in download
    Storage._download_s3(uri, out_dir)
  File "/usr/local/lib/python3.7/site-packages/kserve/storage.py", line 128, in _download_s3
    bucket.download_file(obj.key, target)
  File "/usr/local/lib/python3.7/site-packages/boto3/s3/inject.py", line 247, in bucket_download_file
    ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
  File "/usr/local/lib/python3.7/site-packages/boto3/s3/inject.py", line 173, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python3.7/site-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/usr/local/lib/python3.7/site-packages/s3transfer/futures.py", line 106, in result
    return self._coordinator.result()
  File "/usr/local/lib/python3.7/site-packages/s3transfer/futures.py", line 265, in result
    raise self._exception
  File "/usr/local/lib/python3.7/site-packages/s3transfer/tasks.py", line 126, in __call__
    return self._execute_main(kwargs)
  File "/usr/local/lib/python3.7/site-packages/s3transfer/tasks.py", line 150, in _execute_main
    return_value = self._main(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/s3transfer/download.py", line 512, in _main
    Bucket=bucket, Key=key, **extra_args)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 705, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidArgument) when calling the GetObject operation: Requests specifying Server Side Encryption with AWS KMS managed keys must be made over a secure connection.
Stream closed EOF for kubeflow/abalone-xgboost-predictor-default-56123c34b-b123b (storage-initializer)

What did you expect to happen: The model is downloaded successfully and the inference service id deployed successfully

Anything else you would like to add: Upon checking the storage.py code, I found the following code snippet: image

This shows that the config cannot be set signature_version = 'v4'which, if I am not wrong is required for requests for getting data from KMS enabled S3 buckets.

Is this evaluation correct? Is there a need to add such a signature version in case we need to interact with models in KMS secured S3 buckets? I will try to patch this then.

Environment:

  • Istio Version: N/A
  • Knative Version: N/A
  • KFServing Version: 0.7.0
  • Kubeflow version: N/A
  • Kfdef:aws
  • Minikube version: N/A
  • Kubernetes version: (use kubectl version): 1.18
  • OS (e.g. from /etc/os-release): macOS Big Sur ver 11.6

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
matty-rosecommented, Feb 27, 2022

@revolutionisme I did some digging into the code and found this line https://github.com/kserve/kserve/blob/ad557b0d5e5427256d566cf0b6a08117d40f4073/pkg/credentials/service_account_credentials.go#L100 which means it doesn’t actually trigger adding the required HTTPS environment variables to the storage initializer init-container if the secret has an empty data segment

I got it to successfully pull a model encrypted with a KMS key from a bucket using IRSA with the following config (i’m running v0.8.0 kserve for reference). The empty keys for AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY force it to add the env vars to the init containers but don’t override the actual credentials from the iam role (which happens if you add dummy values)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: model-serving
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::etc
secrets:
- name: aws-secret
---
apiVersion: v1
kind: Secret
metadata:
  name: aws-secret
  annotations:
    serving.kserve.io/s3-endpoint: s3.amazonaws.com
    serving.kserve.io/s3-usehttps: "1"
    serving.kserve.io/s3-region: ap-southeast-2
type: Opaque
data:
  AWS_ACCESS_KEY_ID:
  AWS_SECRET_ACCESS_KEY:
1reaction
revolutionismecommented, May 16, 2022

@matty-rose Thanks a lot for this, I missed the tag last time, but got back to it now and it works great. I will close the ticket as it solves the problem, but i still support the idea to make it more explicit.

Read more comments on GitHub >

github_iconTop Results From Across the Web

storage-initializer container fails to download model from s3
I've tried serving a model stored at s3 bucket following this guide: https://github.com/kubeflow/kfserving/tree/master/docs/samples/s3.
Read more >
Access denied when uploading to KMS-encrypted Amazon S3 ...
My Amazon Simple Storage Service (Amazon S3) bucket has AWS Key Management Service (AWS KMS) default encryption. I'm trying to upload files to...
Read more >
Kubernetes CrashLoopBackOff: What it is, and how to fix it?
Learn to visualize, alert, and troubleshoot a Kubernetes CrashLoopBackOff: A pod starting, crashing, starting again, and crashing again.
Read more >
Troubleshooting a custom key store - AWS Documentation
Disconnect the AWS CloudHSM key store, if it is not already disconnected. You can use the AWS KMS console or AWS KMS API....
Read more >
RHSA-2022:5069 - Security Advisory - Red Hat 고객 포털
prometheus/client_golang: Denial of service using InstrumentHandlerCounter (CVE-2022-21698); golang: crash in a golang.org/x/crypto/ssh server ( ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found