question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

leaking AWS credentials in pytest suite

See original GitHub issue

In a project with a large pytest suite that uses random test orders and several different AWS profiles, the test suite has become fragile and flaky. Every instance of the failure lies in s3fs, no matter whether it is pyarrow or pandas that is using it.

Since https://github.com/dask/s3fs/pull/244 - 0.4 - is there a persistent cache of AWS credentials somewhere in s3fs that is not cleared when a pytest fixture changes some env-vars for AWS credentials (using monkeypatch.setenv and monkeypatch.delenv)? When the flaky tests are run in isolation, they use the AWS env-var mock-credentials, but when they are mixed in with a full random test suite that could run a different test with different credentials first, the flaky tests fail due to credential failures.

The code under test is usually vanilla pyarrow or pandas, with no explicit use of s3fs or any kwargs that would pass any explicit credentials. In every case, it is assumed that env-vars will be read by a default botocore (aiobotocore) session init.

Example pytest fixtures that set AWS env-vars:


import boto3
import pytest

AWS_REGION = "us-west-2"
AWS_ACCESS_KEY_ID = "test_AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "test_AWS_SECRET_ACCESS_KEY"


@pytest.fixture
def aws_region(monkeypatch) -> str:
    monkeypatch.setenv("AWS_DEFAULT_REGION", AWS_REGION)
    yield AWS_REGION
    monkeypatch.delenv("AWS_DEFAULT_REGION", raising=False)


@pytest.fixture
def aws_credentials(aws_region, monkeypatch):
    """Mocked AWS Credentials for moto."""
    boto3.DEFAULT_SESSION = None
    monkeypatch.delenv("AWS_DEFAULT_PROFILE", raising=False)
    monkeypatch.delenv("AWS_PROFILE", raising=False)
    monkeypatch.setenv("AWS_ACCESS_KEY_ID", AWS_ACCESS_KEY_ID)
    monkeypatch.setenv("AWS_SECRET_ACCESS_KEY", AWS_SECRET_ACCESS_KEY)
    monkeypatch.setenv("AWS_SECURITY_TOKEN", "testing")
    monkeypatch.setenv("AWS_SESSION_TOKEN", "testing")

    yield

    monkeypatch.delenv("AWS_DEFAULT_PROFILE", raising=False)
    monkeypatch.delenv("AWS_DEFAULT_REGION", raising=False)
    monkeypatch.delenv("AWS_ACCOUNT", raising=False)
    monkeypatch.delenv("AWS_ACCESS_KEY_ID", raising=False)
    monkeypatch.delenv("AWS_SECRET_ACCESS_KEY", raising=False)

Any given pytest can use that fixture to manipulate the credentials available for the test. While most tests use mocks with moto, some tests use actual AWS-live credentials on an account dedicated to AWS-live unit tests. The AWS-live tests need to switch the credentials, so they use a similar fixture that manipulates env-vars. With randomized tests, the use of pyarrow and pandas with s3 objects is flaky because it seems like s3fs is caching a session or something under the hood.

Could a test fixture like this somehow reset or clear any cached session anywhere in s3fs? The API docs don’t say much about how clear_instance_cache works. Are there any class methods on S3FileSystem that could clear any cached sessions? How would a test fixture like the above clear s3fs sessions, like boto3.DEFAULT_SESSION = None?

Example traceback:


    csv_df = pd.read_csv(csv_file)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:610: in read_csv
    return _read(filepath_or_buffer, kwds)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:462: in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:819: in __init__
    self._engine = self._make_engine(self.engine)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:1050: in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:1867: in __init__
    self._open_handles(src, kwds)
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/parsers.py:1368: in _open_handles
    storage_options=kwds.get("storage_options", None),
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/common.py:563: in get_handle
    storage_options=storage_options,
/opt/conda/envs/gis/lib/python3.7/site-packages/pandas/io/common.py:345: in _get_filepath_or_buffer
    filepath_or_buffer, mode=fsspec_mode, **(storage_options or {})
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/core.py:134: in open
    out = self.__enter__()
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/core.py:102: in __enter__
    f = self.fs.open(self.path, mode=mode)
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/spec.py:943: in open
    **kwargs,
/opt/conda/envs/gis/lib/python3.7/site-packages/s3fs/core.py:378: in _open
    autocommit=autocommit, requester_pays=requester_pays)
/opt/conda/envs/gis/lib/python3.7/site-packages/s3fs/core.py:1097: in __init__
    cache_type=cache_type)
/opt/conda/envs/gis/lib/python3.7/site-packages/fsspec/spec.py:1265: in __init__
    self.details = fs.info(path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <s3fs.core.S3FileSystem object at 0x7fd7f1d4b890>, path = 'a-private-bucket/a-private.csv', version_id = None, refresh = False

    def info(self, path, version_id=None, refresh=False):
        path = self._strip_protocol(path)
        if path in ['/', '']:
            return {'name': path, 'size': 0, 'type': 'directory'}
        kwargs = self.kwargs.copy()
        if version_id is not None:
            if not self.version_aware:
                raise ValueError("version_id cannot be specified if the "
                                 "filesystem is not version aware")
        bucket, key, path_version_id = self.split_path(path)
        version_id = _coalesce_version_id(path_version_id, version_id)
        if self.version_aware or (key and self._ls_from_cache(path) is None) or refresh:
            try:
                out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket,
                                    Key=key, **version_id_kw(version_id), **self.req_kw)
                return {
                    'ETag': out['ETag'],
                    'Key': '/'.join([bucket, key]),
                    'LastModified': out['LastModified'],
                    'Size': out['ContentLength'],
                    'size': out['ContentLength'],
                    'name': '/'.join([bucket, key]),
                    'type': 'file',
                    'StorageClass': "STANDARD",
                    'VersionId': out.get('VersionId')
                }
            except ClientError as e:
                ee = translate_boto_error(e)
                # This could have failed since the thing we are looking for is a prefix.
                if isinstance(ee, FileNotFoundError):
                    return super(S3FileSystem, self).info(path)
                else:
>                   raise ee
E                   PermissionError: Forbidden

/opt/conda/envs/gis/lib/python3.7/site-packages/s3fs/core.py:548: PermissionError

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Apr 20, 2021

OK to close this?

1reaction
martindurantcommented, Apr 6, 2021

Yes, that looks fine. Without passing explicit parameters to the storage backend (storage_options=) in read_csv, the instance would be the equivalent of S3FileSystem() without kwargs - if you wanted to examine its state for some reason.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Infosys leaked FullAdminAccess AWS keys on PyPi for over a ...
I once saw a production server (and it's version controlled IaC) running on a devs login credentials. This server was in charge of...
Read more >
Recommended way to manage credentials with multiple AWS ...
Create credentials file​​ aws/config or ~/. boto ), unset environmental varialbe BOTO_CONFIG if set and possibly also the file, to which such ...
Read more >
How to test your AWS code using Moto and Pytest - Learn AWS
For E.g. if we want to test S3 functionality, we would use mock_s3 . Setup dummy credentials: It is recommended to set up...
Read more >
How to separate your credentials, secrets, and configurations ...
This article shows how to separate your credentials and configurations from the application source code with the environment variables and ...
Read more >
Python Application Credentials Management With boto3 and ...
Less credentials leaking. AWS Secrets Manager. AWS Secret Manager is an AWS product used to store and retrieve secrets. It has a cost...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found