`filecache` and `simplecache` both ignore `version_id` for versioned S3 filesystems
See original GitHub issueCaching S3 data with e.g., simplecache
works fine. However, S3FS also supports versioned buckets.
When a version_id
is provided to open
, using a cache leads to incorrect behavior, as the cache seems to drop the version argument and always requests, and subsequently caches, the latest version.
The code below should reproduce the issue, as long as you provide it with the bucket name for a versioned bucket. Removing the caching fixes the issue.
Possible fixes:
- make caching aware of S3 version IDs (may need to “hack” the local disk cache paths to include the version IDs)
- make caches raise an error when they detect an attempt to wrap a versioned bucket, in order to protect the user from this surprising behavior
Please let me know if I can provide any additional information or code!
import random
import time
import fsspec
import numpy as np
import s3fs
def eval_version_caching():
bucket = "MY_TEST_BUCKET"
fs = s3fs.S3FileSystem(version_aware=True)
print(fs.ls(bucket))
version_dir = f"tmp/dbg-version-eval-{random.randint(0, 100000):010d}"
data_path = f"{bucket}/{version_dir}/data.npy"
# 8 * 2 * 2 Mb, roughly = 32 Mb
some_data = np.ones((2000, 2000), dtype=np.float64) * 1.0
some_other_data = np.ones((2000, 2000), dtype=np.float64) * 42.0
with fs.open(data_path, "wb") as f_out:
np.save(f_out, some_data)
print("Saved first version... Waiting to flush.")
time.sleep(1.0)
with fs.open(data_path, "wb") as f_out:
np.save(f_out, some_other_data)
print("Saved second version... Waiting to flush.")
time.sleep(1.0)
versions = fs.object_version_info(data_path)
assert 2 == len(versions)
# Versions are listed in reverse chronological order. In our case, we call the first version "a", and the second
# version "b".
id_rev_a = versions[1]["VersionId"]
id_rev_b = versions[0]["VersionId"]
print(id_rev_a)
print(id_rev_b)
# The cached FS always returns the latest version, no matter what version_id we provide.
cached_fs = fsspec.filesystem(
"filecache", target_protocol="s3", target_options={"version_aware": True}, cache_storage="/tmp/s3_file_cache"
)
# Using this non cached filesystem correctly retrieves the correct versions
# non_cached_fs = s3fs.S3FileSystem(version_aware=True)
with cached_fs.open(data_path, "rb", version_id=id_rev_a) as f_out:
read_data = np.load(f_out)
print("Revision A:")
print(read_data.mean())
rev_a_mean = read_data.mean()
with cached_fs.open(data_path, "rb", version_id=id_rev_b) as f_out:
read_data = np.load(f_out)
print("Revision B:")
print(read_data.mean())
rev_b_mean = read_data.mean()
assert abs(rev_a_mean - rev_b_mean) > 1e-5
def main():
eval_version_caching()
if __name__ == "__main__":
main()
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
s3fs local filecache of versioned flies
file version id support for versioned S3 buckets, i.e. the ability to open different versions of the same remote file based on their...
Read more >Using versioning in S3 buckets
Use versioning in Amazon S3 to keep multiple variants of an object in the same ... in a bucket, all new objects are...
Read more >S3FS - FUSE-based file system backed by Amazon S3
-o use_cache (default="" which means disabled) local folder to use for local file cache. -o check_cache_dir_exist (default is disable) If use_cache is set, ......
Read more >S3Fs Documentation
8 S3 Compatible Storage. 17. 9 Requester Pays Buckets. 19. 10 Serverside Encryption. 21. 11 Bucket Version Awareness. 23. 12 Contents.
Read more >How to keep your files safe in S3 with versioning
A new S3 bucket has versioning disabled by default. By… ... any object stored before enabling versioning will have a version id of...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I believe this is still the case. If you have a HTTP URL embedding the version, that should be fine.
Yes, you are quite right: the cache only uses the filename as input to derive the local path, and (apparently) doesn’t pass arguments on. For the specific case of s3, you can embed the version into the filename “bucket/path/key?versionId=…”, so that might be enough.