More possibilities to cancel downloads inside HTTP downloader handler
See original GitHub issueHello,
Currently, a download is cancelled in the HTTP downloader if the expected size of the response is greater than the DOWNLOAD_MAXSIZE
setting. However, there is no way to cancel a download after the headers are received and before the body is downloaded based on other conditions, such as the value of a specific header, and I see some cases where it could be useful.
For instance, one cannot rely on LinkExtractor to filter out media links (images, videos, etc…) since a link without a media extension could still be a media. Thus, by having a way to obtain the headers of a response when they are received, one can check the value of the Content-Type
header and trigger the cancellation of the download if necessary.
I thought about an implementation and came up with this . The main idea is when the headers of the response are received, the downloader handler sends a signal headers_received
with the txresponse
and the request
, and cancels the download based on the return value of the first receiver’s callback. It is a quick hack but is not very intrusive. The main drawback is that when connecting to this signal in a spider, one must specifies sender=Any
as the crawler’s signal manager is not available in the downloader handler.
I’m waiting for your remarks and ideas.
Many thanks,
JB.
Issue Analytics
- State:
- Created 8 years ago
- Comments:7 (5 by maintainers)
Top GitHub Comments
Hey @kmike. Interesting, I wasn’t aware of this thread, thanks for pointing this out. #4205 adds a way to stop downloads, but this issue still seems valid to me because headers are not sent as arguments in the signal. They are available though, it would be a matter of doing something like:
Or adding a new signal as originally proposed, or both; I don’t really have a strong preference either way. A good thing is that the sender issue should not be a problem anymore, now that the download handler has access to the crawler instance since #4205.
//cc @elacuesta - is it fixed by the signals you introduced?