Storage initializer goes into CrashLoopBackOff, can't copy model from KMS enabled S3 bucket
See original GitHub issue/kind bug
What steps did you take and what happened: Created an inference service using the following:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: abalone-xgboost
namespace: kubeflow
spec:
predictor:
serviceAccountName: pipeline-runner
xgboost:
protocolVersion: "v2"
storageUri: "s3://bucket-name/abalone-xgboost/model.tar.gz"
The pipeline-runner service account consists of a role (IRSA) and is configured as follows:
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/kfp-example-pod-role
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"application-crd-id":"kubeflow-pipelines"},"name":"pipeline-runner","namespace":"kubeflow"}}
creationTimestamp: "2021-10-01T13:38:27Z"
labels:
application-crd-id: kubeflow-pipelines
name: pipeline-runner
namespace: kubeflow
ownerReferences:
- apiVersion: app.k8s.io/v1beta1
blockOwnerDeletion: true
controller: false
kind: Application
name: pipeline
uid: 072e21eb-eaa7-4b3b-a353-4b2342e23477
resourceVersion: "712344"
selfLink: /api/v1/namespaces/kubeflow/serviceaccounts/pipeline-runner
uid: d123b35b-9232-4379-b133-aee3245e34ff2
secrets:
- name: pipeline-runner-token-bn2hf
The role kfp-example-pod-role
has full permisisons to read S3 buckets and objects and even KMS permission.
[A clear and concise description of what the bug is.]
When I apply the inference service defined above, the deployment is in 0/1 state, with the pods in CrashLoopBackOff
state. When I check the logs of storage-initializer
inside the pod, the following log is what I get.
/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
"update your install command.", FutureWarning)
[I 220121 09:47:24 initializer-entrypoint:13] Initializing, args: src_uri [s3://bucket-name/abalone-xgboost/model.tar.gz] dest_path[ [/mnt/models]
[I 220121 09:47:24 storage:52] Copying contents of s3://bucket-name/abalone-xgboost/model.tar.gz to local
Traceback (most recent call last):
File "/storage-initializer/scripts/initializer-entrypoint", line 14, in <module>
kserve.Storage.download(src_uri, dest_path)
File "/usr/local/lib/python3.7/site-packages/kserve/storage.py", line 69, in download
Storage._download_s3(uri, out_dir)
File "/usr/local/lib/python3.7/site-packages/kserve/storage.py", line 128, in _download_s3
bucket.download_file(obj.key, target)
File "/usr/local/lib/python3.7/site-packages/boto3/s3/inject.py", line 247, in bucket_download_file
ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
File "/usr/local/lib/python3.7/site-packages/boto3/s3/inject.py", line 173, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/usr/local/lib/python3.7/site-packages/boto3/s3/transfer.py", line 307, in download_file
future.result()
File "/usr/local/lib/python3.7/site-packages/s3transfer/futures.py", line 106, in result
return self._coordinator.result()
File "/usr/local/lib/python3.7/site-packages/s3transfer/futures.py", line 265, in result
raise self._exception
File "/usr/local/lib/python3.7/site-packages/s3transfer/tasks.py", line 126, in __call__
return self._execute_main(kwargs)
File "/usr/local/lib/python3.7/site-packages/s3transfer/tasks.py", line 150, in _execute_main
return_value = self._main(**kwargs)
File "/usr/local/lib/python3.7/site-packages/s3transfer/download.py", line 512, in _main
Bucket=bucket, Key=key, **extra_args)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidArgument) when calling the GetObject operation: Requests specifying Server Side Encryption with AWS KMS managed keys must be made over a secure connection.
Stream closed EOF for kubeflow/abalone-xgboost-predictor-default-56123c34b-b123b (storage-initializer)
What did you expect to happen: The model is downloaded successfully and the inference service id deployed successfully
Anything else you would like to add:
Upon checking the storage.py
code, I found the following code snippet:
This shows that the config cannot be set signature_version = 'v4'
which, if I am not wrong is required for requests for getting data from KMS enabled S3 buckets.
Is this evaluation correct? Is there a need to add such a signature version in case we need to interact with models in KMS secured S3 buckets? I will try to patch this then.
Environment:
- Istio Version: N/A
- Knative Version: N/A
- KFServing Version: 0.7.0
- Kubeflow version: N/A
- Kfdef:aws
- Minikube version: N/A
- Kubernetes version: (use
kubectl version
): 1.18 - OS (e.g. from
/etc/os-release
): macOS Big Sur ver 11.6
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (1 by maintainers)
Top GitHub Comments
@revolutionisme I did some digging into the code and found this line https://github.com/kserve/kserve/blob/ad557b0d5e5427256d566cf0b6a08117d40f4073/pkg/credentials/service_account_credentials.go#L100 which means it doesn’t actually trigger adding the required HTTPS environment variables to the storage initializer init-container if the secret has an empty
data
segmentI got it to successfully pull a model encrypted with a KMS key from a bucket using IRSA with the following config (i’m running v0.8.0 kserve for reference). The empty keys for
AWS_ACCESS_KEY_ID
/AWS_SECRET_ACCESS_KEY
force it to add the env vars to the init containers but don’t override the actual credentials from the iam role (which happens if you add dummy values)@matty-rose Thanks a lot for this, I missed the tag last time, but got back to it now and it works great. I will close the ticket as it solves the problem, but i still support the idea to make it more explicit.