Allow failing on potential data loss to trigger a retry
See original GitHub issueDescription
By default settings of DOWNLOAD_FAIL_ON_DATALOSS implemented in #2590, whenever ResponseFailed([_DataLoss])
error occurs, it should be raised. But very similar error PotentialDataLoss
cannot be raised or retried.
We pass all such responses through to the callback function after adding the ['partial']
flag to the response object. This is where we add the flag after checking the PotentialDataLoss.
elif reason.check(PotentialDataLoss):
self._finished.callback((self._txresponse, body, ['partial']))
Now this prevents the retry middleware to step in and retry (depending on settings) the failed request.
Steps to Reproduce
Try to Fetch
the following URL for around 150 times and somewhere in between the host server starts sending the partial/bad content-length header resulting in PotentialDataLoss
error and addition of ‘partial’ to the flags.
URL: https://ecorp.azcc.gov/BusinessSearch/BusinessInfo?entityNumber=21816333
The above URL sometimes return a 404 before providing a response with missing Content-Length
which consequently results in DataLoss / PotentialDataLoss.
Expected behavior: ResponseFailed([_PotentialDataLoss]) exception should be raised.
Actual behavior:
Partial response passes through the engine to the callback function after adding 'partial'
flag.
Following is one such example log generated by the example url above.
DEBUG: scrapy.core.engine: _on_success: Crawled (200) <GET https://ecorp.azcc.gov/BusinessSearch/BusinessInfo?entityNumber=21816333> ['partial']
Reproduces how often: Every time when the server’s response is missing ‘Content-Length’ header.
Versions
Scrapy : 1.7.3 lxml : 4.4.1.0 libxml2 : 2.9.5 cssselect : 1.1.0 parsel : 1.5.2 w3lib : 1.21.0 Twisted : 19.7.0 Python : 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] pyOpenSSL : 19.0.0 (OpenSSL 1.1.1c 28 May 2019) cryptography : 2.7 Platform : Windows-10-10.0.18362-SP0
–
I could not find any URL to regenerate the ResponseFailed([_DataLoss])
error at very first Request, if someone knows any such URL, please let me know so I can update the steps to Reproduce here.
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
For that case, I believe solving #3590 would be better, since the problem is not the
Content-Type
header, but the fact that the response is a CAPTCHA response. If they started including aContent-Type
header in those responses, you would still want to retry the corresponding request.Hey @royahsan. This looks a bit tricky.
From the commit data, this is not what’s happening.
['partial']
is added onPotentialDataLoss
, not_DataLoss
, as you said yourself:I can see why you would get confused there (I was too, just now), when the docs state:
This could be explained a bit more explicitly.
The difference is this: A missing
Content-Length
header doesn’t really matter, it’s just like a missing checksum on transfered data length (but it will warn with thepartial
flag). It may or may not be complete, no way to verify without the “checksum”! But when given, Scrapy checks theContent-Length
header against the actual calculated body length, and if those are different, then it raises theResponseFailed([_DataLoss])
, unlessfail_on_dataloss=False
.If you want to treat potential dataloss (partial) as hard fails, I don’t know if that’s actually possible right now. So if that’s what you want, let us know.