Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature suggest: add downloaded/uptodate status to information about downloaded media

See original GitHub issue

Currently File/Image pipelines populate files/images fields with dicts containing information about the downloaded files (the downloaded path, the original scraped url, and the file checksum). It would be useful to have downloaded/uptodate status in this dict (motivation).

It goes along with other features requests such as having width/height of images also in the dict output.

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

ilias-antcommented, Apr 3, 2020

I think the first approach (in a previous crawl) is sufficient and well-defined, at the right level of abstractness.

The second approach is probably going to face significant definition challenges in Scrapy projects that utilize more than one Media Pipelines or heavily depend on a specific file expiration policy. Since the whole point of the file expiration policy is to “avoid downloading files that were downloaded recently”, I think to take into account the “current crawl” semantics would really confuse the end-user.

Overall, maintaining consistency with the file_status_count statistics is definitely the way.

I hope I did not miss your point on this distinction 😃

1reaction

ilias-antcommented, Apr 2, 2020

this is a great feature suggestion, that is general enough to have many applications on projects that utilize Scrapy!

Hope I find the time to have a try on this soon!

update: did a bit of research and realized that this enhancement is achievable with minimal source code alterations. Namely:

FilesPipeline.media_to_download#_onsuccess callback should now return: {'url': request.url, 'path': path, 'checksum': checksum, 'status': 'uptodate'}
FilesPipeline.media_downloaded should now return: {'url': request.url, 'path': path, 'checksum': checksum, 'status': status}
the necessary testing and documentation considerations

Any feedback on this will be deeply appreciated and if this is indeed the case, I will certainly open a pull request (if that ok with you) 😃

Top Results From Across the Web

Understand and troubleshoot Updates and Servicing

This article helps administrators understand Updates and Servicing and troubleshoot some common issues in Configuration Manager.

How To Deploy Software Updates Using SCCM ConfigMgr

Use this guide to deploy the software updates to enterprise computers. ... Download Software Updates in Configuration Manager.

About iOS 16 Updates

Big updates to Messages let you edit or unsend a message you just sent. Visual Look Up lets you lift the subject of...

Accessing & Downloading Your Information

1. Switch to the profile that you want to download information from, then tap . ; 2. Scroll down and tap Settings. ;...

Downloading your account data | LinkedIn Help

Select the data that you're looking for and Request archive. ... Use the link provided in the email to download the information you...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Feature suggest: add downloaded/uptodate status to information about downloaded media

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Scrapy spider fails to terminate after finishing web scrape

Scrapy crawl spider didnt see links on ubuntu 16.04 but work fine with macOS sierra 10.12.6