question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Allow failing on potential data loss to trigger a retry

See original GitHub issue

Description

By default settings of DOWNLOAD_FAIL_ON_DATALOSS implemented in #2590, whenever ResponseFailed([_DataLoss]) error occurs, it should be raised. But very similar error PotentialDataLoss cannot be raised or retried.

We pass all such responses through to the callback function after adding the ['partial'] flag to the response object. This is where we add the flag after checking the PotentialDataLoss.

elif reason.check(PotentialDataLoss):
    self._finished.callback((self._txresponse, body, ['partial']))

Now this prevents the retry middleware to step in and retry (depending on settings) the failed request.

Steps to Reproduce

Try to Fetch the following URL for around 150 times and somewhere in between the host server starts sending the partial/bad content-length header resulting in PotentialDataLoss error and addition of ‘partial’ to the flags.

URL: https://ecorp.azcc.gov/BusinessSearch/BusinessInfo?entityNumber=21816333

The above URL sometimes return a 404 before providing a response with missing Content-Length which consequently results in DataLoss / PotentialDataLoss.

Expected behavior: ResponseFailed([_PotentialDataLoss]) exception should be raised.

Actual behavior: Partial response passes through the engine to the callback function after adding 'partial' flag. Following is one such example log generated by the example url above.

DEBUG: scrapy.core.engine: _on_success:  Crawled (200) <GET https://ecorp.azcc.gov/BusinessSearch/BusinessInfo?entityNumber=21816333> ['partial']

Reproduces how often: Every time when the server’s response is missing ‘Content-Length’ header.

Versions

Scrapy : 1.7.3 lxml : 4.4.1.0 libxml2 : 2.9.5 cssselect : 1.1.0 parsel : 1.5.2 w3lib : 1.21.0 Twisted : 19.7.0 Python : 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] pyOpenSSL : 19.0.0 (OpenSSL 1.1.1c 28 May 2019) cryptography : 2.7 Platform : Windows-10-10.0.18362-SP0

I could not find any URL to regenerate the ResponseFailed([_DataLoss]) error at very first Request, if someone knows any such URL, please let me know so I can update the steps to Reproduce here.

@rmax, @nyov

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Gallaeciocommented, Nov 8, 2019

For that case, I believe solving #3590 would be better, since the problem is not the Content-Type header, but the fact that the response is a CAPTCHA response. If they started including a Content-Type header in those responses, you would still want to retry the corresponding request.

1reaction
nyovcommented, Nov 6, 2019

Hey @royahsan. This looks a bit tricky.

My understanding that ['partial'] keyword is a must indication of ResponseFailed([_DataLoss]) is based on the code above

From the commit data, this is not what’s happening. ['partial'] is added on PotentialDataLoss, not _DataLoss, as you said yourself:

This is where we add the flag after checking the PotentialDataLoss.

I can see why you would get confused there (I was too, just now), when the docs state:

DOWNLOAD_FAIL_ON_DATALOSS

Whether or not to fail on broken responses, that is, declared Content-Length does not match content sent by the server or chunked response was not properly finish.

This could be explained a bit more explicitly.

The difference is this: A missing Content-Length header doesn’t really matter, it’s just like a missing checksum on transfered data length (but it will warn with the partial flag). It may or may not be complete, no way to verify without the “checksum”! But when given, Scrapy checks the Content-Length header against the actual calculated body length, and if those are different, then it raises the ResponseFailed([_DataLoss]), unless fail_on_dataloss=False.

If you want to treat potential dataloss (partial) as hard fails, I don’t know if that’s actually possible right now. So if that’s what you want, let us know.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure Functions error handling and retry guidance
It's possible that an instance has a failure between retry attempts. When an instance fails during a retry policy, the retry count is...
Read more >
Application Issues - Amazon Kinesis Data Analytics
Force-stopping your application may lead to data loss or duplication. To prevent data loss or duplicate processing of data during application restarts, ...
Read more >
Three Ways to Retry Failures In Your Serverless Application
When something goes wrong in an asynchronous process, you don't want it to fail and be lost forever. It's possible the process failed...
Read more >
Using retries to build reliable serverless systems - Google Cloud
Cloud Functions pro tips: Using retries to build reliable ... Error scenario: potential data loss ... Let's look at some examples.
Read more >
net - Know when to retry or fail when calling SQL Server from C
For everything else rely on error class, retrying when Class is 13 or ... Number == 3989) // 3989 = New request is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found