question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Storage] Blob download: short reads

See original GitHub issue
  • Package Name: azure-storage-blob
  • Package Version: 12.5.0
  • Operating System: Debian 9
  • Python Version: python3.6

Other relevant packages: azure-core==1.10.0

Describe the bug

Reading the blob content can return too few bytes, leading to data corruption.

To Reproduce Steps to reproduce the behavior:

  1. create some blobs with fake data (not too small: best beyond the maximum get blob size, which has a default of 32MB)
  2. download these blobs completely via BlobClient.download_blob().readall(). If the download returned without error, compare the content with the expected one from the upload in step 1.

Note that you have to repeat step 2. about 1 million times (depending on network, may be more or less; I estimate this is the order of probability we observe); alternatively, take other measures to increase the likelihood of a connection error while downloading the blob content.

Expected behavior

When the blob download completed without exception, the download content must match exactly with what was uploaded before, including the length.

Additional context

We observe this mainly when downloading parts of blobs, i.e. when passing non-zero offset and length parameters to BlobClient.download_blob()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
jochen-ott-bycommented, Feb 22, 2021

I’m very concerned about the retry at this point of the library stack. Let me take some time to explain why and how complex a correct solution would be. I hope this convinces you to simply remove the retries altogether, which in my opinion is the proper way to deal with the problem at this point of the code.

First, as some context, consider what this code is used for: it is the basic building block for a lot of azure python sdk libraries. As such, it should be extremely careful about making assumptions about:

  • how it is used by other azure python sdk libraries
  • how servers behave
  • how underlying libraries (requests, urllib3 in this case) behave.

So this core library should avoid any assumptions beyond what the http standard allows.

When sticking to what the http standard says, a correct retry implementation is pretty involved. I listed some of the things to consider below. I make no claim this list is complete. Some of the points are not blockers in that they merely limit the usefulness of the re-try to certain situations (such as the signature problem), or they make the re-try much less efficient (when re-submitting the original request for a retry after all).

  • not all requests should be re-tried, re-try should be limited to safe http methods. As this is about streaming the body, the only http method this applies to is GET.
  • making the request again with a certain “range” header will require choosing proper parameters for the request range. I discussed in some detail in this comment: https://github.com/Azure/azure-sdk-for-python/pull/16783#discussion_r579411043 why this is not possible in general with the requests/urllib3 stack. So one either has to restrict this “re-try with subrange” to certain cases that are known to work (such as: content-encoding and transfer encoding are non-chunked identify encodings). Alternatively, the re-try could re-submit the original request, without the attempt to set a new “range” parameter and discard any data already returned to the caller (note that this also side-steps the signature and range vs. x-ms-range problems discussed below).
  • setting a “range” header in the request that re-tries could be ignored by the service. This will be true e.g. for the GET blob operation, which will give precedence to x-ms-range, see https://docs.microsoft.com/en-us/rest/api/storageservices/specifying-the-range-header-for-blob-service-operations
  • setting a “range” header can invalidate authentication headers, see https://docs.microsoft.com/en-us/rest/api/storageservices/authorize-with-shared-key#blob-queue-and-file-services-shared-key-authorization
  • The server might ignore the “range” header and respond with 200, rather than 206. This must be checked in the client.
  • The response of the new request in the re-try will attempt to combine data from different requests. Such combination of partial data is prone to data corruption e.g. in case the server creates the response dynamically and can can have byte-level differences (say, the order of json keys), even if the semantics are the same. This is considered in the http standard, see section 3.3 of RFC 7234 (“Combining Partial Content”) and section 4.3. of RFC 7233 (“Combining Ranges”). In short, ranges must not be combined, unless the responses have the same (non-weak) ETag.
  • The client MUST check the headers of the response, not only for status code 206 but also the specific range is what was expected, as other range fields (such as x-ms-range) can take precedence over the “range” header field.

Overall, the implementation of re-tries at this level of the library will be pretty complex, and still only apply to a certain subset of cases (for example, the re-try that sets ranges will never be effective for GET blob operations for the reasons discussed above). I therefore think the re-try at this point should be removed. It’s better to let the exception propagate and to re-try at a higher level, where new requests for ranges can be constructed correctly (e.g. with the correct signature). Also, at this point there is usually a lot more context information available, which means that re-tries at higher levels do not need to consider all of the special cases listed above. For example, they might already be in the code path of a GET operation, so there would be no branching concerning the first point.

One final thought: even if you implement re-tries at some point in time, addressing all the subtleties I listed above, I think you should give some priority to fixing this bug quickly first, on a shorter time frame. After all, the current code is a bug resulting in data corruption. This should get more priority than implementing a re-try (which so far was, effectively, not there).

1reaction
peter-hoffmann-bycommented, Feb 17, 2021

Hi @xiangyan99,

https://github.com/Azure/azure-sdk-for-python/commit/5dee530a2b2533ec1f934678ceda5e3f0a24b4db has a reproducing test for the wrong behaviour, so better start with this one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting very slow speed while downloading storage blob.
Hello, I am getting very slow speed while downloading the storage blob from Azure storage account. I am using .net core 3.1 web...
Read more >
Download objects | Cloud Storage | Google Cloud
This page shows you how to download objects from your buckets in Cloud Storage. For a conceptual overview, see Uploads and downloads.
Read more >
How to download a file to browser from Azure Blob Storage
Parse(azureConnectionString); var backupBlobClient = backupStorageAccount.CreateCloudBlobClient(); var container = backupBlobClient.
Read more >
Azure blob storage job - Blogs & Documents - BMC Community
Azure blob storage job · Container (Up/Download): Select the container where the blob need to be uploaded to or downloaded from · Blob...
Read more >
Bulk download from Azure Blob Storage with C# - elmah.io Blog
A quick example of how to easily download multiple blobs in bulk from Azure Blob Storage using the Azure.Storage.Blobs NuGet package.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found