FilesDownloader with GCS downloading updtodate files again
See original GitHub issueDescription
It seems that when using Google Cloud Storage, the Files pipeline does not have the expected behavior regarding up to date files.
Steps to Reproduce
- Clone this repo :
git clone https://github.com/QYQ323/python.git
- Run the spider :
scrapy crawl examples
- If you run it several times, the FilesPipeline has the right behavior : it does not download uptodate files
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example_writer.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/bayes_update.py> referred in <None>
- Now change the
FILE_STORE
insettings.py
to a gcs bucket
FILES_STORE = 'gs://mybucket/'
- If you then run the spider several times, the files are downloaded everytime :
2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/simple_anim.py> referred in <None>
2020-02-19 14:50:44 [urllib3.connectionpool] DEBUG: https://storage.googleapis.com:443 "POST /upload/storage/v1/b/cdcscrapingresults/o?uploadType=multipart HTTP/1.1" 200 843
2020-02-19 14:50:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> (referer: None)
2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> referred in <None>
2020-02-19 14:50:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://matplotlib.org/examples/api/collections_demo.html>
Expected behavior: Files should not be downloaded again when running the spider consecutively. If a file is allready on GCS (same folder), it should not be downloaded (provided it was uploaded less than 90 days ago)
Actual behavior: Everytime the spider is launched every file is downloaded again.
Reproduces how often: 100%
Versions
Scrapy : 1.8.0 lxml : 4.5.0.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.5.2 w3lib : 1.21.0 Twisted : 19.10.0 Python : 3.8.1 (default, Jan 8 2020, 16:15:59) - [Clang 4.0.1 (tags/RELEASE_401/final)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019) cryptography : 2.8 Platform : macOS-10.15.3-x86_64-i386-64bit
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:10 (7 by maintainers)
Top GitHub Comments
I cannot reproduce this bug. @lblanche, are you sure you set up permissions for the bucket correctly? The very first time I’ve tried reproducing it I got a setup where the service account I used had write permissions, but for some reason calling
get_blob
on the bucket raised a 403, which causedstat_file
method inGCSFilesStore
to fail, and that caused the file to be downloaded every time. After fixing the permissions everything worked as it should. If that’s the case here, I think it would be a good idea to check permissions inGCSFilesStore
’s__init__
and display a warning if it’s impossible to get file’s metadata from the bucket.Hello, can I work on this issue?