leaking AWS credentials in pytest suite
See original GitHub issueIn a project with a large pytest suite that uses random test orders and several different AWS profiles, the test suite has become fragile and flaky. Every instance of the failure lies in s3fs, no matter whether it is pyarrow or pandas that is using it.
Since https://github.com/dask/s3fs/pull/244 - 0.4 - is there a persistent cache of AWS credentials somewhere in s3fs that is not cleared when a pytest fixture changes some env-vars for AWS credentials (using monkeypatch.setenv and monkeypatch.delenv)? When the flaky tests are run in isolation, they use the AWS env-var mock-credentials, but when they are mixed in with a full random test suite that could run a different test with different credentials first, the flaky tests fail due to credential failures.
The code under test is usually vanilla pyarrow or pandas, with no explicit use of s3fs or any kwargs that would pass any explicit credentials. In every case, it is assumed that env-vars will be read by a default botocore (aiobotocore) session init.
Example pytest fixtures that set AWS env-vars:
import boto3
import pytest
AWS_REGION = "us-west-2"
AWS_ACCESS_KEY_ID = "test_AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "test_AWS_SECRET_ACCESS_KEY"
@pytest.fixture
def aws_region(monkeypatch) -> str:
monkeypatch.setenv("AWS_DEFAULT_REGION", AWS_REGION)
yield AWS_REGION
monkeypatch.delenv("AWS_DEFAULT_REGION", raising=False)
@pytest.fixture
def aws_credentials(aws_region, monkeypatch):
"""Mocked AWS Credentials for moto."""
boto3.DEFAULT_SESSION = None
monkeypatch.delenv("AWS_DEFAULT_PROFILE", raising=False)
monkeypatch.delenv("AWS_PROFILE", raising=False)
monkeypatch.setenv("AWS_ACCESS_KEY_ID", AWS_ACCESS_KEY_ID)
monkeypatch.setenv("AWS_SECRET_ACCESS_KEY", AWS_SECRET_ACCESS_KEY)
monkeypatch.setenv("AWS_SECURITY_TOKEN", "testing")
monkeypatch.setenv("AWS_SESSION_TOKEN", "testing")
yield
monkeypatch.delenv("AWS_DEFAULT_PROFILE", raising=False)
monkeypatch.delenv("AWS_DEFAULT_REGION", raising=False)
monkeypatch.delenv("AWS_ACCOUNT", raising=False)
monkeypatch.delenv("AWS_ACCESS_KEY_ID", raising=False)
monkeypatch.delenv("AWS_SECRET_ACCESS_KEY", raising=False)
Any given pytest can use that fixture to manipulate the credentials available for the test. While most tests use mocks with moto, some tests use actual AWS-live credentials on an account dedicated to AWS-live unit tests. The AWS-live tests need to switch the credentials, so they use a similar fixture that manipulates env-vars. With randomized tests, the use of pyarrow and pandas with s3 objects is flaky because it seems like s3fs is caching a session or something under the hood.
Could a test fixture like this somehow reset or clear any cached session anywhere in s3fs? The API docs don’t say much about how clear_instance_cache
works. Are there any class methods on S3FileSystem
that could clear any cached sessions? How would a test fixture like the above clear s3fs sessions, like boto3.DEFAULT_SESSION = None
?
Example traceback:
csv_df = pd.read_csv(csv_file)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:610: in read_csv
return _read(filepath_or_buffer, kwds)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:462: in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:819: in __init__
self._engine = self._make_engine(self.engine)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:1050: in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:1867: in __init__
self._open_handles(src, kwds)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:1368: in _open_handles
storage_options=kwds.get("storage_options", None),
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/common.py:563: in get_handle
storage_options=storage_options,
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/common.py:345: in _get_filepath_or_buffer
filepath_or_buffer, mode=fsspec_mode, **(storage_options or {})
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/core.py:134: in open
out = self.__enter__()
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/core.py:102: in __enter__
f = self.fs.open(self.path, mode=mode)
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/spec.py:943: in open
**kwargs,
/opt/conda/envs/gis/lib/python3.7/site-packages/s3fs/core.py:378: in _open
autocommit=autocommit, requester_pays=requester_pays)
/opt/conda/envs/gis/lib/python3.7/site-packages/s3fs/core.py:1097: in __init__
cache_type=cache_type)
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/spec.py:1265: in __init__
self.details = fs.info(path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <s3fs.core.S3FileSystem object at 0x7fd7f1d4b890>, path = 'a-private-bucket/a-private.csv', version_id = None, refresh = False
def info(self, path, version_id=None, refresh=False):
path = self._strip_protocol(path)
if path in ['/', '']:
return {'name': path, 'size': 0, 'type': 'directory'}
kwargs = self.kwargs.copy()
if version_id is not None:
if not self.version_aware:
raise ValueError("version_id cannot be specified if the "
"filesystem is not version aware")
bucket, key, path_version_id = self.split_path(path)
version_id = _coalesce_version_id(path_version_id, version_id)
if self.version_aware or (key and self._ls_from_cache(path) is None) or refresh:
try:
out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket,
Key=key, **version_id_kw(version_id), **self.req_kw)
return {
'ETag': out['ETag'],
'Key': '/'.join([bucket, key]),
'LastModified': out['LastModified'],
'Size': out['ContentLength'],
'size': out['ContentLength'],
'name': '/'.join([bucket, key]),
'type': 'file',
'StorageClass': "STANDARD",
'VersionId': out.get('VersionId')
}
except ClientError as e:
ee = translate_boto_error(e)
# This could have failed since the thing we are looking for is a prefix.
if isinstance(ee, FileNotFoundError):
return super(S3FileSystem, self).info(path)
else:
> raise ee
E PermissionError: Forbidden
/opt/conda/envs/gis/lib/python3.7/site-packages/s3fs/core.py:548: PermissionError
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
OK to close this?
Yes, that looks fine. Without passing explicit parameters to the storage backend (
storage_options=
) inread_csv
, the instance would be the equivalent ofS3FileSystem()
without kwargs - if you wanted to examine its state for some reason.