Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FilesDownloader with GCS downloading updtodate files again

See original GitHub issue

Description

It seems that when using Google Cloud Storage, the Files pipeline does not have the expected behavior regarding up to date files.

Steps to Reproduce

Clone this repo : git clone https://github.com/QYQ323/python.git
Run the spider : scrapy crawl examples
If you run it several times, the FilesPipeline has the right behavior : it does not download uptodate files

2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example_writer.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/bayes_update.py> referred in <None>

Now change the FILE_STORE in settings.py to a gcs bucket

FILES_STORE = 'gs://mybucket/'

If you then run the spider several times, the files are downloaded everytime :

2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/simple_anim.py> referred in <None>
2020-02-19 14:50:44 [urllib3.connectionpool] DEBUG: https://storage.googleapis.com:443 "POST /upload/storage/v1/b/cdcscrapingresults/o?uploadType=multipart HTTP/1.1" 200 843
2020-02-19 14:50:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> (referer: None)
2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> referred in <None>
2020-02-19 14:50:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://matplotlib.org/examples/api/collections_demo.html>

Expected behavior: Files should not be downloaded again when running the spider consecutively. If a file is allready on GCS (same folder), it should not be downloaded (provided it was uploaded less than 90 days ago)

Actual behavior: Everytime the spider is launched every file is downloaded again.

Reproduces how often: 100%

Versions

Scrapy : 1.8.0 lxml : 4.5.0.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.5.2 w3lib : 1.21.0 Twisted : 19.10.0 Python : 3.8.1 (default, Jan 8 2020, 16:15:59) - [Clang 4.0.1 (tags/RELEASE_401/final)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019) cryptography : 2.8 Platform : macOS-10.15.3-x86_64-i386-64bit

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:10 (7 by maintainers)

Top GitHub Comments

2reactions

michalp2213commented, Mar 22, 2020

I cannot reproduce this bug. @lblanche, are you sure you set up permissions for the bucket correctly? The very first time I’ve tried reproducing it I got a setup where the service account I used had write permissions, but for some reason calling get_blob on the bucket raised a 403, which caused stat_file method in GCSFilesStore to fail, and that caused the file to be downloaded every time. After fixing the permissions everything worked as it should. If that’s the case here, I think it would be a good idea to check permissions in GCSFilesStore’s __init__ and display a warning if it’s impossible to get file’s metadata from the bucket.

1reaction

michalp2213commented, Mar 21, 2020

Hello, can I work on this issue?