question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FilesDownloader with GCS downloading updtodate files again

See original GitHub issue

Description

It seems that when using Google Cloud Storage, the Files pipeline does not have the expected behavior regarding up to date files.

Steps to Reproduce

  1. Clone this repo : git clone https://github.com/QYQ323/python.git
  2. Run the spider : scrapy crawl examples
  3. If you run it several times, the FilesPipeline has the right behavior : it does not download uptodate files
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example_writer.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/bayes_update.py> referred in <None>
  1. Now change the FILE_STORE in settings.py to a gcs bucket

FILES_STORE = 'gs://mybucket/'

  1. If you then run the spider several times, the files are downloaded everytime :
2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/simple_anim.py> referred in <None>
2020-02-19 14:50:44 [urllib3.connectionpool] DEBUG: https://storage.googleapis.com:443 "POST /upload/storage/v1/b/cdcscrapingresults/o?uploadType=multipart HTTP/1.1" 200 843
2020-02-19 14:50:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> (referer: None)
2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> referred in <None>
2020-02-19 14:50:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://matplotlib.org/examples/api/collections_demo.html>

Expected behavior: Files should not be downloaded again when running the spider consecutively. If a file is allready on GCS (same folder), it should not be downloaded (provided it was uploaded less than 90 days ago)

Actual behavior: Everytime the spider is launched every file is downloaded again.

Reproduces how often: 100%

Versions

Scrapy : 1.8.0 lxml : 4.5.0.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.5.2 w3lib : 1.21.0 Twisted : 19.10.0 Python : 3.8.1 (default, Jan 8 2020, 16:15:59) - [Clang 4.0.1 (tags/RELEASE_401/final)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019) cryptography : 2.8 Platform : macOS-10.15.3-x86_64-i386-64bit

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
michalp2213commented, Mar 22, 2020

I cannot reproduce this bug. @lblanche, are you sure you set up permissions for the bucket correctly? The very first time I’ve tried reproducing it I got a setup where the service account I used had write permissions, but for some reason calling get_blob on the bucket raised a 403, which caused stat_file method in GCSFilesStore to fail, and that caused the file to be downloaded every time. After fixing the permissions everything worked as it should. If that’s the case here, I think it would be a good idea to check permissions in GCSFilesStore’s __init__ and display a warning if it’s impossible to get file’s metadata from the bucket.

1reaction
michalp2213commented, Mar 21, 2020

Hello, can I work on this issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Download objects | Cloud Storage | Google Cloud
Download an object from a bucket​​ In the list of buckets, click on the name of the bucket that contains the object you...
Read more >
How to download files from Google Cloud Storage with Python ...
How to download files from Google Cloud Storage with Python and GCS REST API · file_id = 'shakespeare' # File Name. folder='/google-cloud/ ...
Read more >
Uploading and Downloading Zip Files In GCP Cloud Storage ...
Using an application to automate the process of creating, altering, or unzipping a zip file in memory is a useful skill to have...
Read more >
Is there a way load a GCS file instead of first downloading it ...
In addition, you could use Streaming Transfers. Cloud Storage supports streaming transfers, which allow you to stream data to and from your ...
Read more >
Connect to Google Cloud Storage - Looker Studio Help
Update the data; File format; Limits of the GCS connector; Related resources. How to connect to Google Cloud Storage. A GCS data source...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found