Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Progress bar for large downloads

See original GitHub issue

What are your thoughts on adding a progress bar to the scrapy HTTP handler? I recently wrote a crawler that would scrape a site and throw any files into a FilesPipeline for download. Some of these files were 100+ MB in size which made the terminal seem to “freeze” while they downloaded in the background. I know scrapy isn’t really designed to be an efficient file downloader like aria2 or jdownloader, but it’s a handy tool and I was already using it to scrape the file list.

I wrote a proof of concept using the Python library tqdm and it went even better than expected - tqdm automatically handles multiple progress bars at a time (scrapy’s queue) so I got a clean section at the bottom of the console showing individual progress for each pending file over 5MB in size.

Since I leaned so heavily on tqdm, the change to the scrapy source was only ~15 lines of code to fully implement (the POC patch is at the bottom of this post). If this feature is worth including, I’d expect other changes too since I’m sure you don’t want scrapy to take a hard dependency on tqdm and the progress bar should have some configuration options, too.

Screenshot from 2019-05-24 15-39-33

Considerations

Disable the progress bar in noninteractive mode (does scrapy have this? how does scrapinghub behave?)
Optional dependency on tqdm (or code the feature from scratch within scrapy? - this may be a lot of work)
Configurable minimum size threshold for triggering the progress bar.
- If tqdm is allowed as an optional dependency, the http11 handler should set a warning if devs try to set the minimum threshold but do not have tqdm installed
What to do when txresponse.length is UNKNOWN_LENGTH? This can happen if the server does not return a Content-Length header. Should it be disabled entirely? Or monitor _bytes_received and lazily create a progress bar if it crosses the threshold?

Patch

I am using scrapy 1.5.0 in my POC but it looks like the source for http11 in master is unchanged except for the addition of one line disabling lazy, so the patch line numbers are mostly off by one.

--- ~/scrapy-1.5.0/scrapy/core/downloader/handlers/http11.py
+++ ~/.local/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py
@@ -28,6 +28,9 @@
 from scrapy.utils.misc import load_object
 from scrapy.utils.python import to_bytes, to_unicode
 from scrapy import twisted_version
+
+from tqdm import tqdm
+
 
 logger = logging.getLogger(__name__)
 
@@ -432,6 +435,15 @@
         self._reached_warnsize = False
         self._bytes_received = 0
 
+        self.progress = None
+        try:
+            length = int(txresponse.length)
+            # show progress if > 5MB
+            if length > 5242880:
+                self.progress = tqdm(total=length, unit='B', unit_scale=True)
+        except ValueError:
+            pass 
+
     def dataReceived(self, bodyBytes):
         # This maybe called several times after cancel was called with buffered
         # data.
@@ -439,7 +451,10 @@
             return
 
         self._bodybuf.write(bodyBytes)
-        self._bytes_received += len(bodyBytes)
+        new_bytes = len(bodyBytes)
+        self._bytes_received += new_bytes
+        if self.progress is not None:
+            self.progress.update(new_bytes)
 
         if self._maxsize and self._bytes_received > self._maxsize:
             logger.error("Received (%(bytes)s) bytes larger than download "
@@ -460,6 +475,9 @@
                             'request': self._request})
 
     def connectionLost(self, reason):
+        if self.progress is not None:
+            self.progress.close()
+
         if self._finished.called:
             return

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

nemeccommented, May 2, 2022

@elacuesta just wanted to pop in and say your linked pull request works great for me. Thanks to the scrapy devs for adding the signal hooks needed 😃

0reactions

gndpscommented, May 30, 2020

A workaround can be to direct all the std.out to a log file using these settings:

LOG_STDOUT = True
LOG_FILE = 'my_spider.log_file'

and use tqdm around the loops for tracking in the terminal

Top Results From Across the Web

How to add progress bar when downloading large files. #4269

I am downloading a large file (5GB) from the bucket. How to show progress bar when blob.download_to_filename(destination_file_name) starts ...

How to download files in Python with progress bars - YouTube

A tutorial for using Python to bulk download a list of pdfs from a text file or CSV. Learn to create progress bars...

Progress bar when downloading files is not updating until after ...

Small files it updates after file has downloaded. Large files it doesn't update at all. Here is my code: Imports System.Net ...

Downloading large images showing Progress bar - B4X

No Brian, I am not referring to that link. I have to download images only upto 10MB size and need to show progress...

Reporting the percentage progress of large file downloads in ...

When downloading very large files from an online source, it can sometimes appear that the program has frozen or crashed when in fact...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Progress bar for large downloads

Considerations

Patch

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Fallback parser rules in ItemLoader - discussion for spider maintenance

Return more detailed error message in scrapy check