question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`filecache` and `simplecache` both ignore `version_id` for versioned S3 filesystems

See original GitHub issue

Caching S3 data with e.g., simplecache works fine. However, S3FS also supports versioned buckets.

When a version_id is provided to open, using a cache leads to incorrect behavior, as the cache seems to drop the version argument and always requests, and subsequently caches, the latest version.

The code below should reproduce the issue, as long as you provide it with the bucket name for a versioned bucket. Removing the caching fixes the issue.

Possible fixes:

  • make caching aware of S3 version IDs (may need to “hack” the local disk cache paths to include the version IDs)
  • make caches raise an error when they detect an attempt to wrap a versioned bucket, in order to protect the user from this surprising behavior

Please let me know if I can provide any additional information or code!

import random
import time

import fsspec
import numpy as np
import s3fs


def eval_version_caching():
    bucket = "MY_TEST_BUCKET"

    fs = s3fs.S3FileSystem(version_aware=True)
    print(fs.ls(bucket))

    version_dir = f"tmp/dbg-version-eval-{random.randint(0, 100000):010d}"
    data_path = f"{bucket}/{version_dir}/data.npy"

    # 8 * 2 * 2 Mb, roughly = 32 Mb
    some_data = np.ones((2000, 2000), dtype=np.float64) * 1.0
    some_other_data = np.ones((2000, 2000), dtype=np.float64) * 42.0

    with fs.open(data_path, "wb") as f_out:
        np.save(f_out, some_data)

    print("Saved first version... Waiting to flush.")
    time.sleep(1.0)

    with fs.open(data_path, "wb") as f_out:
        np.save(f_out, some_other_data)

    print("Saved second version... Waiting to flush.")
    time.sleep(1.0)

    versions = fs.object_version_info(data_path)
    assert 2 == len(versions)

    # Versions are listed in reverse chronological order. In our case, we call the first version "a", and the second
    # version "b".
    id_rev_a = versions[1]["VersionId"]
    id_rev_b = versions[0]["VersionId"]
    print(id_rev_a)
    print(id_rev_b)

    # The cached FS always returns the latest version, no matter what version_id we provide.
    cached_fs = fsspec.filesystem(
        "filecache", target_protocol="s3", target_options={"version_aware": True}, cache_storage="/tmp/s3_file_cache"
    )
    # Using this non cached filesystem correctly retrieves the correct versions
    # non_cached_fs = s3fs.S3FileSystem(version_aware=True)
    with cached_fs.open(data_path, "rb", version_id=id_rev_a) as f_out:
        read_data = np.load(f_out)
        print("Revision A:")
        print(read_data.mean())
        rev_a_mean = read_data.mean()

    with cached_fs.open(data_path, "rb", version_id=id_rev_b) as f_out:
        read_data = np.load(f_out)
        print("Revision B:")
        print(read_data.mean())
        rev_b_mean = read_data.mean()

    assert abs(rev_a_mean - rev_b_mean) > 1e-5



def main():
    eval_version_caching()


if __name__ == "__main__":
    main()

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Aug 4, 2022

I believe this is still the case. If you have a HTTP URL embedding the version, that should be fine.

1reaction
martindurantcommented, May 12, 2021

Yes, you are quite right: the cache only uses the filename as input to derive the local path, and (apparently) doesn’t pass arguments on. For the specific case of s3, you can embed the version into the filename “bucket/path/key?versionId=…”, so that might be enough.

Read more comments on GitHub >

github_iconTop Results From Across the Web

s3fs local filecache of versioned flies
file version id support for versioned S3 buckets, i.e. the ability to open different versions of the same remote file based on their...
Read more >
Using versioning in S3 buckets
Use versioning in Amazon S3 to keep multiple variants of an object in the same ... in a bucket, all new objects are...
Read more >
S3FS - FUSE-based file system backed by Amazon S3
-o use_cache (default="" which means disabled) local folder to use for local file cache. -o check_cache_dir_exist (default is disable) If use_cache is set, ......
Read more >
S3Fs Documentation
8 S3 Compatible Storage. 17. 9 Requester Pays Buckets. 19. 10 Serverside Encryption. 21. 11 Bucket Version Awareness. 23. 12 Contents.
Read more >
How to keep your files safe in S3 with versioning
A new S3 bucket has versioning disabled by default. By… ... any object stored before enabling versioning will have a version id of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found