question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Retry not working correctly for large files

See original GitHub issue

Which service(blob, file, queue) does this issue concern?

Blob

What problem was encountered?

Upload of a large file had a failing blob block upload. Azure storage says it correctly re-uploaded, but according to Storage analytics that did not happen:

  • <Agent>=Azure-Storage/1.1.0-1.1.0 (Python CPython 2.7.9; Linux 3.16.0-5-amd64)
  • MAX_BLOCK_SIZE=100000000 = 100 MB
  • max_connections=8
  • timeout=30
  • Total Blob size: 4294903296 = 4 GiB
  • Only 4249585152 were uploaded (checked with az storage blob show)

Azure Storage of the particular block

2018-04-18 16:33:04,807 <az> INFO  Client-Request-ID=2b3d9568-4326-11e8-8ef0-000d3a29b453 Outgoing request: Method=PUT, Path=/<container>/<blob>?<SAS>, Query={'comp': 'block', 'blockid': u'QmxvY2tJZDAwMDAx', 'timeout': '30'}, Headers={'Content-Length': '100000000', 'x-ms-client-request-id': '2b3d9568-4326-11e8-8ef0-000d3a29b453', 'User-Agent': '<Agent>', 'x-ms-version': '2017-07-29', 'x-ms-lease-id': None, 'x-ms-date': 'Wed, 18 Apr 2018 16:33:04 GMT'}.
2018-04-18 16:33:38,838 <az> INFO  Client-Request-ID=2b3d9568-4326-11e8-8ef0-000d3a29b453 Operation failed: checking if the operation should be retried. Current retry count=0, , HTTP status code=Unknown, Exception=SSLError: HTTPSConnectionPool(host='<account>.blob.core.windows.net', port=443): Max retries exceeded with url: /<container>/<blob>?<SAS>&comp=block&blockid=QmxvY2tJZDAwMDAx&timeout=30 (Caused by SSLError(SSLError('The write operation timed out',),)).
2018-04-18 16:33:53,265 <az> INFO  Client-Request-ID=2b3d9568-4326-11e8-8ef0-000d3a29b453 Outgoing request: Method=PUT, Path=/<container>/<blob>?<SAS>, Query={'comp': 'block', 'blockid': u'QmxvY2tJZDAwMDAx', 'timeout': '30'}, Headers={'Content-Length': '100000000', 'x-ms-client-request-id': '2b3d9568-4326-11e8-8ef0-000d3a29b453', 'User-Agent': '<Agent>', 'x-ms-version': '2017-07-29', 'x-ms-lease-id': None, 'x-ms-date': 'Wed, 18 Apr 2018 16:33:53 GMT'}.
2018-04-18 16:33:59,569 <az> INFO  Client-Request-ID=2b3d9568-4326-11e8-8ef0-000d3a29b453 Receiving Response: Server-Timestamp=Wed, 18 Apr 2018 16:33:58 GMT, Server-Request-ID=2c9b33ae-701e-0109-6133-d7286a000000, HTTP Status Code=201, Message=Created, Headers={'Content-Length': '100000000', 'x-ms-client-request-id': '2b3d9568-4326-11e8-8ef0-000d3a29b453', 'User-Agent': '<Agent>', 'x-ms-version': '2017-07-29', 'x-ms-lease-id': None, 'x-ms-date': 'Wed, 18 Apr 2018 16:33:53 GMT'}.

Storage Analytics of the particular block:

1.0;2018-04-18T16:33:53.3070387Z;PutBlock;SASSuccess;201;6260;5669;sas;;<account>;blob;"https://<account>.blob.core.windows.net:443/<container>/<blob>?<SAS>&amp;comp=block&amp;blockid=QmxvY2tJZDAwMDAx&amp;timeout=30";"/<account>/<container>/<blob>";2c9b33ae-701e-0109-6133-d7286a000000;0;<internal IP address>:53156;2017-07-29;567;54681856;193;0;54681856;;"zNZ1j2PNDLDEV5szpt0DKg==";;;;"<Agent>";;"2b3d9568-4326-11e8-8ef0-000d3a29b453"
  • Azure Storage sees 'Content-Length': '100000000', while Storage Analytics says it only received 54681856 bytes.
    • 100000000 - 54681856 = 45318144
    • 4294903296 - 4249585152 = 45318144
  • Storage Analytics only registers the black after the retry started.
  • Storage Analytics did show no data for the allegedly successful upload.

I can also provide you with the other logs of the successfully uploaded blocks, etc., but they contain little information: Content-length is consistent, return-codes are ok (201), timestamps are reasonable, blocklist/head upload happens in the correct order, etc.

I can not reliably reproduce the error, but I see it occasionally when uploading several TiB from on-premise as well as Azure-VMs.

Have you found a mitigation/solution?

Not yet, but complete re-upload of the block after the blocklist has been committed seems sensible.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
zezha-msftcommented, Apr 27, 2018

Hi @Fra-nk, I’m glad to hear that!

The release is going out in 1-2 weeks, as we have a few more changes to merge in. Thank you!

1reaction
zezha-msftcommented, Apr 24, 2018

Hi @Fra-nk, calling create_blob_from_path is perfectly fine, I was just making sure that the issue occurred when a seekable stream was used as body. Basically the bug was that we weren’t rewinding the body (seekable stream) properly in the case of a retry. And with larger files, it is more likely that we have retries occurring and thus encountering this bug. It should now work properly after the fix on dev branch.

Please let me know if you encounter any other problem or have any question. I’ll keep this issue open until the fix gets released. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

kafka retry many times when i download large file
My team found my problem when I redeploy k8s pods, which lead to conflict leader partition causing rebalance. It will try to process...
Read more >
Resolve issues with uploading large files in Amazon S3
I'm trying to upload a large file (1 GB or larger) to Amazon Simple Storage Service (Amazon S3) using the console. However, the...
Read more >
Error retries and exponential backoff in AWS
Configure retry settings in the client application when errors occur and use an exponential backoff algorithm for better flow control.
Read more >
Error Request timed out when you try to upload a large file ...
You try to upload a large file to a document library. ... To work around this problem, edit the <configuration> section in the...
Read more >
Fix file syncing issues in the Creative Cloud desktop app
Fix file syncing issues in the Creative Cloud desktop app ... Unable to sync files because your system time is not correctly set....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found