question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

upload file with boto, download it with boto3: file gets corrupted (wrong md5 sum)

See original GitHub issue

Hi,

The following code uploads a file to a mock S3 bucket using boto, and downloads the same file to the local disk using boto3. I apologize for bringing both of the libraries into this, but the code I am testing in real life still uses both (definitely trying to get rid of all the boto code and fully migrate to boto3 but that isn’t going to happen right away).

What happens is that the resulting file does not have the same md5 sum as the original file so it has been corrupted at some point (not sure if it was during the boto upload or the boto3 download).

This seems to be an issue with moto because if I comment out the line @moto.mock_s3 (using ‘real’ S3) the script works fine (I also need to change the bucket name to a unique one to avoid collisions).

The script keeps looping (doing the upload/download/md5sum comparison) until it fails (because in my real project this would not happen every time) but this test script seems to fail (for me anyway) on the first attempt every time.

The test file that it uploads/downloads is available here.

You can download it with:

curl -O  https://s3-us-west-2.amazonaws.com/demonstrate-moto-problem/K158154-Mi001716_S1_L001_R1_001.fastq.gz

At this point if you run md5sum on it you should get 6083801a29ef4ebf78fbbed806e6ab2c:

$ md5sum K158154-Mi001716_S1_L001_R1_001.fastq.gz
6083801a29ef4ebf78fbbed806e6ab2c  K158154-Mi001716_S1_L001_R1_001.fastq.gz

Here is the test script (motoprob.py):

import sys
import os
import hashlib
import moto
import boto
import boto3

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()



@moto.mock_s3
def doit():
    # upload file to s3
    conn = boto.connect_s3()
    bkt = conn.create_bucket("mybucket")
    key = boto.s3.key.Key(bkt)
    key.key = "foo/bar.fastq.gz"
    print("Uploading...")

    # You can get this file from:
    #  https://s3-us-west-2.amazonaws.com/demonstrate-moto-problem/K158154-Mi001716_S1_L001_R1_001.fastq.gz
    key.set_contents_from_filename("K158154-Mi001716_S1_L001_R1_001.fastq.gz")

    # download it again
    dlfile = "bar.fastq.gz"
    if os.path.exists(dlfile):
        os.remove(dlfile)

    print("Downloading...")

    client = boto3.client('s3')
    client.download_file(Bucket="mybucket",
      Key="foo/bar.fastq.gz", Filename="bar.fastq.gz")


    md5sum = md5(dlfile)
    if not md5sum == "6083801a29ef4ebf78fbbed806e6ab2c":
        print("Incorrect md5sum! {}").format(md5sum)
        sys.exit(1)


while True:
    doit()

Version info:

$ pip freeze |grep oto
boto==2.42.0
boto3==1.4.0
botocore==1.4.48
moto==0.4.29

$ python --version
Python 2.7.12

$ uname -a
Linux f51bec2ad3be 4.9.4-moby #1 SMP Wed Jan 18 17:04:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ more /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"

Other ways to see that the resulting file is not the same as the original:

$ diff bar.fastq.gz /data/2016-10-27-PT140/K158154-Mi001716_S1_L001_R1_001.fastq.gz
Binary files bar.fastq.gz and /data/2016-10-27-PT140/K158154-Mi001716_S1_L001_R1_001.fastq.gz differ


$ zcat bar.fastq.gz > bar.fastq # this works for the original file

gzip: bar.fastq.gz: invalid compressed data--crc error

gzip: bar.fastq.gz: invalid compressed data--length error

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:16 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
bblommerscommented, May 28, 2022

There have been a few improvements in how we handle md5sums/etags since 2020 - is anyone still running into issues using the latest version of Moto?

1reaction
snordhausencommented, Jan 26, 2017

Turns out that disabling multi-threading in managed transfer methods is quite easy. With this change in moto, your original code works fine:

diff --git a/moto/core/models.py b/moto/core/models.py
index 60e744f..03d1390 100644
--- a/moto/core/models.py
+++ b/moto/core/models.py
@@ -4,6 +4,7 @@ import functools
 import inspect
 import re
 
+import boto3
 from httpretty import HTTPretty
 from .responses import metadata_response
 from .utils import convert_regex_to_flask_path
@@ -11,6 +12,7 @@ from .utils import convert_regex_to_flask_path
 
 class MockAWS(object):
     nested_count = 0
+    original_create_transfer_manager = None
 
     def __init__(self, backends):
         self.backends = backends
@@ -38,6 +40,15 @@ class MockAWS(object):
         if not HTTPretty.is_enabled():
             HTTPretty.enable()
 
+        if self.__class__.original_create_transfer_manager is None:
+            boto3.client('s3') # Ensure that boto.s3 exists
+            original_create_transfer_manager = boto3.s3.transfer.create_transfer_manager
+            self.__class__.original_create_transfer_manager = original_create_transfer_manager
+            def patched_create_transfer_manager(client, config, *args, **kwargs):
+                config.use_threads = False
+                return original_create_transfer_manager(client, config, *args, **kwargs)
+            boto3.s3.transfer.create_transfer_manager = patched_create_transfer_manager
+
         for method in HTTPretty.METHODS:
             backend = list(self.backends.values())[0]
             for key, value in backend.urls.items():
@@ -63,6 +74,8 @@ class MockAWS(object):
         if self.__class__.nested_count == 0:
             HTTPretty.disable()
             HTTPretty.reset()
+            boto3.s3.transfer.create_transfer_manager = self.__class__.original_create_transfer_manager
+            self.__class__.original_create_transfer_manager = None
 
     def decorate_callable(self, func, reset):
         def wrapper(*args, **kwargs):

@spulec: Should I create a pull request for that or do you see a more elegant way of doing this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

boto get md5 s3 file - python
yes use bucket.get_key('file_name').etag[1 :-1] this way get key's MD5 without downloading it's contents.
Read more >
S3 — Boto3 Docs 1.26.34 documentation - AWS
With multipart uploads, this may not be a checksum value of the object. ... Amazon S3 receives the copy request or while Amazon...
Read more >
Check the integrity of an object uploaded to Amazon S3
Verify the integrity of the uploaded object. When you use PutObject to upload objects to Amazon S3, pass the Content-MD5 value as a...
Read more >
The common mistake people make with boto3 file upload
Boto3 users encounter problems too while trying to use Boto3 File Upload, and when they get into these problems, they always tend to...
Read more >
s4cmd
Faster upload with lazy evaluation of md5 hash. Listing large number of files with S3 pagination, with memory is the limit. New directory...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found