question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Progress bar for large downloads

See original GitHub issue

What are your thoughts on adding a progress bar to the scrapy HTTP handler? I recently wrote a crawler that would scrape a site and throw any files into a FilesPipeline for download. Some of these files were 100+ MB in size which made the terminal seem to “freeze” while they downloaded in the background. I know scrapy isn’t really designed to be an efficient file downloader like aria2 or jdownloader, but it’s a handy tool and I was already using it to scrape the file list.

I wrote a proof of concept using the Python library tqdm and it went even better than expected - tqdm automatically handles multiple progress bars at a time (scrapy’s queue) so I got a clean section at the bottom of the console showing individual progress for each pending file over 5MB in size.

Since I leaned so heavily on tqdm, the change to the scrapy source was only ~15 lines of code to fully implement (the POC patch is at the bottom of this post). If this feature is worth including, I’d expect other changes too since I’m sure you don’t want scrapy to take a hard dependency on tqdm and the progress bar should have some configuration options, too.

Screenshot from 2019-05-24 15-39-33

Considerations

  • Disable the progress bar in noninteractive mode (does scrapy have this? how does scrapinghub behave?)
  • Optional dependency on tqdm (or code the feature from scratch within scrapy? - this may be a lot of work)
  • Configurable minimum size threshold for triggering the progress bar.
    • If tqdm is allowed as an optional dependency, the http11 handler should set a warning if devs try to set the minimum threshold but do not have tqdm installed
  • What to do when txresponse.length is UNKNOWN_LENGTH? This can happen if the server does not return a Content-Length header. Should it be disabled entirely? Or monitor _bytes_received and lazily create a progress bar if it crosses the threshold?

Patch

I am using scrapy 1.5.0 in my POC but it looks like the source for http11 in master is unchanged except for the addition of one line disabling lazy, so the patch line numbers are mostly off by one.

--- ~/scrapy-1.5.0/scrapy/core/downloader/handlers/http11.py
+++ ~/.local/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py
@@ -28,6 +28,9 @@
 from scrapy.utils.misc import load_object
 from scrapy.utils.python import to_bytes, to_unicode
 from scrapy import twisted_version
+
+from tqdm import tqdm
+
 
 logger = logging.getLogger(__name__)
 
@@ -432,6 +435,15 @@
         self._reached_warnsize = False
         self._bytes_received = 0
 
+        self.progress = None
+        try:
+            length = int(txresponse.length)
+            # show progress if > 5MB
+            if length > 5242880:
+                self.progress = tqdm(total=length, unit='B', unit_scale=True)
+        except ValueError:
+            pass 
+
     def dataReceived(self, bodyBytes):
         # This maybe called several times after cancel was called with buffered
         # data.
@@ -439,7 +451,10 @@
             return
 
         self._bodybuf.write(bodyBytes)
-        self._bytes_received += len(bodyBytes)
+        new_bytes = len(bodyBytes)
+        self._bytes_received += new_bytes
+        if self.progress is not None:
+            self.progress.update(new_bytes)
 
         if self._maxsize and self._bytes_received > self._maxsize:
             logger.error("Received (%(bytes)s) bytes larger than download "
@@ -460,6 +475,9 @@
                             'request': self._request})
 
     def connectionLost(self, reason):
+        if self.progress is not None:
+            self.progress.close()
+
         if self._finished.called:
             return
 

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
nemeccommented, May 2, 2022

@elacuesta just wanted to pop in and say your linked pull request works great for me. Thanks to the scrapy devs for adding the signal hooks needed 😃

0reactions
gndpscommented, May 30, 2020

A workaround can be to direct all the std.out to a log file using these settings:

LOG_STDOUT = True
LOG_FILE = 'my_spider.log_file'

and use tqdm around the loops for tracking in the terminal

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to add progress bar when downloading large files. #4269
I am downloading a large file (5GB) from the bucket. How to show progress bar when blob.download_to_filename(destination_file_name) starts ...
Read more >
How to download files in Python with progress bars - YouTube
A tutorial for using Python to bulk download a list of pdfs from a text file or CSV. Learn to create progress bars...
Read more >
Progress bar when downloading files is not updating until after ...
Small files it updates after file has downloaded. Large files it doesn't update at all. Here is my code: Imports System.Net ...
Read more >
Downloading large images showing Progress bar - B4X
No Brian, I am not referring to that link. I have to download images only upto 10MB size and need to show progress...
Read more >
Reporting the percentage progress of large file downloads in ...
When downloading very large files from an online source, it can sometimes appear that the program has frozen or crashed when in fact...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found