Progress bar for large downloads
See original GitHub issueWhat are your thoughts on adding a progress bar to the scrapy HTTP handler? I recently wrote a crawler that would scrape a site and throw any files into a FilesPipeline
for download. Some of these files were 100+ MB in size which made the terminal seem to “freeze” while they downloaded in the background. I know scrapy isn’t really designed to be an efficient file downloader like aria2
or jdownloader
, but it’s a handy tool and I was already using it to scrape the file list.
I wrote a proof of concept using the Python library tqdm and it went even better than expected - tqdm automatically handles multiple progress bars at a time (scrapy’s queue) so I got a clean section at the bottom of the console showing individual progress for each pending file over 5MB in size.
Since I leaned so heavily on tqdm, the change to the scrapy source was only ~15 lines of code to fully implement (the POC patch is at the bottom of this post). If this feature is worth including, I’d expect other changes too since I’m sure you don’t want scrapy to take a hard dependency on tqdm and the progress bar should have some configuration options, too.
Considerations
- Disable the progress bar in noninteractive mode (does scrapy have this? how does scrapinghub behave?)
- Optional dependency on tqdm (or code the feature from scratch within scrapy? - this may be a lot of work)
- Configurable minimum size threshold for triggering the progress bar.
- If tqdm is allowed as an optional dependency, the http11 handler should set a warning if devs try to set the minimum threshold but do not have tqdm installed
- What to do when
txresponse.length
isUNKNOWN_LENGTH
? This can happen if the server does not return aContent-Length
header. Should it be disabled entirely? Or monitor_bytes_received
and lazily create a progress bar if it crosses the threshold?
Patch
I am using scrapy 1.5.0 in my POC but it looks like the source for http11 in master is unchanged except for the addition of one line disabling lazy
, so the patch line numbers are mostly off by one.
--- ~/scrapy-1.5.0/scrapy/core/downloader/handlers/http11.py
+++ ~/.local/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py
@@ -28,6 +28,9 @@
from scrapy.utils.misc import load_object
from scrapy.utils.python import to_bytes, to_unicode
from scrapy import twisted_version
+
+from tqdm import tqdm
+
logger = logging.getLogger(__name__)
@@ -432,6 +435,15 @@
self._reached_warnsize = False
self._bytes_received = 0
+ self.progress = None
+ try:
+ length = int(txresponse.length)
+ # show progress if > 5MB
+ if length > 5242880:
+ self.progress = tqdm(total=length, unit='B', unit_scale=True)
+ except ValueError:
+ pass
+
def dataReceived(self, bodyBytes):
# This maybe called several times after cancel was called with buffered
# data.
@@ -439,7 +451,10 @@
return
self._bodybuf.write(bodyBytes)
- self._bytes_received += len(bodyBytes)
+ new_bytes = len(bodyBytes)
+ self._bytes_received += new_bytes
+ if self.progress is not None:
+ self.progress.update(new_bytes)
if self._maxsize and self._bytes_received > self._maxsize:
logger.error("Received (%(bytes)s) bytes larger than download "
@@ -460,6 +475,9 @@
'request': self._request})
def connectionLost(self, reason):
+ if self.progress is not None:
+ self.progress.close()
+
if self._finished.called:
return
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top GitHub Comments
@elacuesta just wanted to pop in and say your linked pull request works great for me. Thanks to the scrapy devs for adding the signal hooks needed 😃
A workaround can be to direct all the std.out to a log file using these settings:
and use
tqdm
around the loops for tracking in the terminal