question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

More possibilities to cancel downloads inside HTTP downloader handler

See original GitHub issue

Hello,

Currently, a download is cancelled in the HTTP downloader if the expected size of the response is greater than the DOWNLOAD_MAXSIZE setting. However, there is no way to cancel a download after the headers are received and before the body is downloaded based on other conditions, such as the value of a specific header, and I see some cases where it could be useful.

For instance, one cannot rely on LinkExtractor to filter out media links (images, videos, etc…) since a link without a media extension could still be a media. Thus, by having a way to obtain the headers of a response when they are received, one can check the value of the Content-Type header and trigger the cancellation of the download if necessary.

I thought about an implementation and came up with this . The main idea is when the headers of the response are received, the downloader handler sends a signal headers_received with the txresponse and the request, and cancels the download based on the return value of the first receiver’s callback. It is a quick hack but is not very intrusive. The main drawback is that when connecting to this signal in a spider, one must specifies sender=Any as the crawler’s signal manager is not available in the downloader handler.

I’m waiting for your remarks and ideas.

Many thanks,

JB.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
elacuestacommented, Aug 23, 2020

Hey @kmike. Interesting, I wasn’t aware of this thread, thanks for pointing this out. #4205 adds a way to stop downloads, but this issue still seems valid to me because headers are not sent as arguments in the signal. They are available though, it would be a matter of doing something like:

diff --git scrapy/core/downloader/handlers/http11.py scrapy/core/downloader/handlers/http11.py
index fb04d1fb..c419b195 100644
--- scrapy/core/downloader/handlers/http11.py
+++ scrapy/core/downloader/handlers/http11.py
@@ -513,6 +513,7 @@ class _ResponseReader(protocol.Protocol):
             data=bodyBytes,
             request=self._request,
             spider=self._crawler.spider,
+            headers=Headers(self._txresponse.headers.getAllRawHeaders()),
         )
         for handler, result in bytes_received_result:
             if isinstance(result, Failure) and isinstance(result.value, StopDownload):

Or adding a new signal as originally proposed, or both; I don’t really have a strong preference either way. A good thing is that the sender issue should not be a problem anymore, now that the download handler has access to the crawler instance since #4205.

0reactions
kmikecommented, Aug 21, 2020

//cc @elacuesta - is it fixed by the signals you introduced?

Read more comments on GitHub >

github_iconTop Results From Across the Web

c# - How cancel Downloading Async? - Stack Overflow
Here is the method for async data download that supports cancellation: private static async Task<byte[]> downloadDataAsync(Uri uri, ...
Read more >
How To Manage File Downloads Using WordPress Download ...
In this article, I will show you how to use WordPress Download Manager to manage all your WordPress file downloads. Let's take a...
Read more >
Documentation Extensions API reference - Chrome Developers
Pause the download. If the request was successful the download is in a paused state. Otherwise runtime.lastError contains an error message. The request...
Read more >
downloads.download() - Mozilla - MDN Web Docs
The download() function of the downloads API downloads a file, given its URL and other optional preferences.
Read more >
Download a file - Computer - Google Chrome Help
At the bottom, find the downloading file you want to pause or cancel. · Next to the file name at the bottom of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found