question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent file downloads from s3fs 0.2.* -> 0.3.*

See original GitHub issue

I’m noticing corrupted file downloads for some files when using s3fs 0.3.*. The problem manifested itself when I was unable to load in certain part files of a parquet file into either spark or dask after downloading them from an s3 bucket. In spark it was manifesting itself as a java.io.IOException: Could not read footer for file error.

After some investigating I determined that s3fs was producing different md5 hashes from using boto3 directly. I further isolated the issue to the 0.3.* update. The following shows the breakdown of the same file downloaded with boto3 and s3fs and then run through the md5 hash digest:

boto3==1.9.202
s3fs==0.2.0
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: 66793fed3e21e38ebc46493a9e8c46c4

boto3==1.9.202
s3fs==0.2.1
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: 66793fed3e21e38ebc46493a9e8c46c4

boto3==1.9.202
s3fs==0.2.2
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: 66793fed3e21e38ebc46493a9e8c46c4

boto3==1.9.202
s3fs==0.3.0
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492

boto3==1.9.202
s3fs==0.3.1
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492

boto3==1.9.202
s3fs==0.3.2
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492

boto3==1.9.202
s3fs==0.3.3
Hash produced with boto3: 66793fed3e21e38ebc46493a9e8c46c4
Hash produced with s3fs: c86cf89971dfb3806166afc92a050492

Each of these was produced with the following code:

import hashlib

import boto3
import s3fs

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

print(f"boto3=={boto3.__version__}")
print(f"s3fs=={s3fs.__version__}")

s3 = boto3.client("s3")
s3.download_file("****", "****/part-0-0", "/home/app/data/testing.boto")
print(f"Hash produced with boto3: {md5('/home/app/data/testing.boto')}")

fs = s3fs.S3FileSystem()
fs.get("s3://****/part-0-0", "/home/app/data/testing.s3fs")
print(f"Hash produced with s3fs: {md5('/home/app/data/testing.s3fs')}")

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Aug 30, 2019

OK, could you also try with s3fs==0.3.4 (released yesterday)?

1reaction
ewellingercommented, Aug 29, 2019

Looks like fsspec==0.4.3 for all of those outputs

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot slow or inconsistent speeds when downloading ...
Resolution. Check the following to identify and mitigate what might be contributing to slow or inconsistent speeds when downloading or ...
Read more >
python 3.x - ImportError: Missing optional dependency 'S3Fs ...
While I am able to get the output CSV file in Pycharm, when I use the same code in Cloud9 IDE on AWS...
Read more >
landcareresearch/amazon_s3 · Manages mounting S3 buckets
35,098 downloads. 1,257 latest version ... To install s3fs and setup the configuration for mounting with default parameters. Note, its recommended to NOT ......
Read more >
s3fs · PyPI
Project description; Project details; Release history; Download files ... S3FS builds on aiobotocore to provide a convenient Python filesystem interface for ...
Read more >
Untitled
backport upstream changes for geanygendoc to work with CTPL 0.3 - require and ... rename patch files - Update to 3.2.6 - Add...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found