Inconsistent file downloads from s3fs 0.2.* -> 0.3.*
See original GitHub issueI’m noticing corrupted file downloads for some files when using s3fs
0.3.*. The problem manifested itself when I was unable to load in certain part files of a parquet file into either spark or dask after downloading them from an s3 bucket. In spark it was manifesting itself as a java.io.IOException: Could not read footer for file
error.
After some investigating I determined that s3fs
was producing different md5 hashes from using boto3
directly. I further isolated the issue to the 0.3.* update. The following shows the breakdown of the same file downloaded with boto3
and s3fs
and then run through the md5 hash digest:
boto3==1.9.202
s3fs==0.2.0
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: 66793fed3e21e38ebc46493a9e8c46c4
boto3==1.9.202
s3fs==0.2.1
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: 66793fed3e21e38ebc46493a9e8c46c4
boto3==1.9.202
s3fs==0.2.2
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: 66793fed3e21e38ebc46493a9e8c46c4
boto3==1.9.202
s3fs==0.3.0
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492
boto3==1.9.202
s3fs==0.3.1
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492
boto3==1.9.202
s3fs==0.3.2
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492
boto3==1.9.202
s3fs==0.3.3
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492
Each of these was produced with the following code:
import hashlib
import boto3
import s3fs
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
print(f"boto3=={boto3.__version__}")
print(f"s3fs=={s3fs.__version__}")
s3 = boto3.client("s3")
s3.download_file("****", "****/part-0-0", "/home/app/data/testing.boto")
print(f"Hash produced with boto3: {md5('/home/app/data/testing.boto')}")
fs = s3fs.S3FileSystem()
fs.get("s3://****/part-0-0", "/home/app/data/testing.s3fs")
print(f"Hash produced with s3fs: {md5('/home/app/data/testing.s3fs')}")
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Troubleshoot slow or inconsistent speeds when downloading ...
Resolution. Check the following to identify and mitigate what might be contributing to slow or inconsistent speeds when downloading or ...
Read more >python 3.x - ImportError: Missing optional dependency 'S3Fs ...
While I am able to get the output CSV file in Pycharm, when I use the same code in Cloud9 IDE on AWS...
Read more >landcareresearch/amazon_s3 · Manages mounting S3 buckets
35,098 downloads. 1,257 latest version ... To install s3fs and setup the configuration for mounting with default parameters. Note, its recommended to NOT ......
Read more >s3fs · PyPI
Project description; Project details; Release history; Download files ... S3FS builds on aiobotocore to provide a convenient Python filesystem interface for ...
Read more >Untitled
backport upstream changes for geanygendoc to work with CTPL 0.3 - require and ... rename patch files - Update to 3.2.6 - Add...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
OK, could you also try with s3fs==0.3.4 (released yesterday)?
Looks like
fsspec==0.4.3
for all of those outputs