upload file with boto, download it with boto3: file gets corrupted (wrong md5 sum)
See original GitHub issueHi,
The following code uploads a file to a mock S3 bucket using boto, and downloads the same file to the local disk using boto3. I apologize for bringing both of the libraries into this, but the code I am testing in real life still uses both (definitely trying to get rid of all the boto code and fully migrate to boto3 but that isn’t going to happen right away).
What happens is that the resulting file does not have the same md5 sum as the original file so it has been corrupted at some point (not sure if it was during the boto upload or the boto3 download).
This seems to be an issue with moto because if I comment out the line @moto.mock_s3
(using ‘real’ S3) the script works fine (I also need to change the bucket name to a unique one to avoid collisions).
The script keeps looping (doing the upload/download/md5sum comparison) until it fails (because in my real project this would not happen every time) but this test script seems to fail (for me anyway) on the first attempt every time.
The test file that it uploads/downloads is available here.
You can download it with:
curl -O https://s3-us-west-2.amazonaws.com/demonstrate-moto-problem/K158154-Mi001716_S1_L001_R1_001.fastq.gz
At this point if you run md5sum on it you should get 6083801a29ef4ebf78fbbed806e6ab2c
:
$ md5sum K158154-Mi001716_S1_L001_R1_001.fastq.gz
6083801a29ef4ebf78fbbed806e6ab2c K158154-Mi001716_S1_L001_R1_001.fastq.gz
Here is the test script (motoprob.py
):
import sys
import os
import hashlib
import moto
import boto
import boto3
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
@moto.mock_s3
def doit():
# upload file to s3
conn = boto.connect_s3()
bkt = conn.create_bucket("mybucket")
key = boto.s3.key.Key(bkt)
key.key = "foo/bar.fastq.gz"
print("Uploading...")
# You can get this file from:
# https://s3-us-west-2.amazonaws.com/demonstrate-moto-problem/K158154-Mi001716_S1_L001_R1_001.fastq.gz
key.set_contents_from_filename("K158154-Mi001716_S1_L001_R1_001.fastq.gz")
# download it again
dlfile = "bar.fastq.gz"
if os.path.exists(dlfile):
os.remove(dlfile)
print("Downloading...")
client = boto3.client('s3')
client.download_file(Bucket="mybucket",
Key="foo/bar.fastq.gz", Filename="bar.fastq.gz")
md5sum = md5(dlfile)
if not md5sum == "6083801a29ef4ebf78fbbed806e6ab2c":
print("Incorrect md5sum! {}").format(md5sum)
sys.exit(1)
while True:
doit()
Version info:
$ pip freeze |grep oto
boto==2.42.0
boto3==1.4.0
botocore==1.4.48
moto==0.4.29
$ python --version
Python 2.7.12
$ uname -a
Linux f51bec2ad3be 4.9.4-moby #1 SMP Wed Jan 18 17:04:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ more /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
Other ways to see that the resulting file is not the same as the original:
$ diff bar.fastq.gz /data/2016-10-27-PT140/K158154-Mi001716_S1_L001_R1_001.fastq.gz
Binary files bar.fastq.gz and /data/2016-10-27-PT140/K158154-Mi001716_S1_L001_R1_001.fastq.gz differ
$ zcat bar.fastq.gz > bar.fastq # this works for the original file
gzip: bar.fastq.gz: invalid compressed data--crc error
gzip: bar.fastq.gz: invalid compressed data--length error
Issue Analytics
- State:
- Created 7 years ago
- Comments:16 (14 by maintainers)
Top GitHub Comments
There have been a few improvements in how we handle md5sums/etags since 2020 - is anyone still running into issues using the latest version of Moto?
Turns out that disabling multi-threading in managed transfer methods is quite easy. With this change in moto, your original code works fine:
@spulec: Should I create a pull request for that or do you see a more elegant way of doing this?