question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FilesPipeline.file_path always getting response=None

See original GitHub issue

Description

As said in the wiki, file_path is a method from scrapy.pipelines.files.FilesPipeline called once per downloaded item.

It returns the download path of the file originating from the specified response.

I’m trying to extend this class and the method file_path is always getting response=None. The actual default implementation of this method is poor because relies on the extension to be in the url instead of the headers Content-Type:

def file_path(self, request, response=None, info=None):
    media_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
    media_ext = os.path.splitext(request.url)[1]
    # Handles empty and wild extensions by trying to guess the
    # mime type then extension or default to empty string otherwise
    if media_ext not in mimetypes.types_map:
        media_ext = ''
        media_type = mimetypes.guess_type(request.url)[0]
        if media_type:
            media_ext = mimetypes.guess_extension(media_type)
    return 'full/%s%s' % (media_guid, media_ext)

For example for this url the extension is not in the url but in the headers instead.

Steps to Reproduce

  1. Extend file_path method from FilesPipeline class

Expected behavior: get the actual response

Actual behavior: response=None always

Reproduces how often: 100%

Versions

Scrapy 2.0.1

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:2
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
scratchmexcommented, Mar 28, 2020

I was able to workaround with this checking if there was a response and if not returing None. This actually worked.

def file_path(self, request, response=None, info=None):
    if not response:
        return
    else:
        ext = mimetypes.guess_extension(response.headers.get('Content-Type').decode("utf"))

Searching in the code, this function is actually called in 3 places in scrapy.pipelines.files.FilesPipeline:

# line 507
def file_downloaded(self, response, request, info):
    path = self.file_path(request, response=response, info=info)
    buf = BytesIO(response.body)
    checksum = md5sum(buf)
    buf.seek(0)
    self.store.persist_file(path, buf, info)
    return checksum
def media_to_download(self, request, info):
# [...]
# line 422
    path = self.file_path(request, info=info)
    dfd = defer.maybeDeferred(self.store.stat_file, path, info)
def media_downloaded(self, response, request, info):
# [...]
# line 477
try:
    path = self.file_path(request, response=response, info=info)
    checksum = self.file_downloaded(response, request, info)

so I assumed that file_path is called before and after the download of the file which led me to return None if no response is supplied.

I don’t know how is the stack of calls with this pipeline but I think it would be great if the default implementation could gather the extension from the Content-Type header. If you could give me some references of how this class works or suggestions of how to implement this I would like to work on this.

0reactions
lubobill1990commented, Jun 6, 2020

When I enable HTTP cache, at the first run, where there is no cache, then the response is not none, but after that, when there is cache, then response is none.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy Override file_path from FilesPipeline - python
Btw, I'm new on python - scrapy. pipelines.py from scrapy.pipelines.files import FilesPipeline class secFilesPipeline(FilesPipeline): ...
Read more >
Downloading and processing files and images
The ImagesPipeline is an extension of the FilesPipeline , customizing the field names and adding custom behavior for images. file_path(self, request, response ......
Read more >
Custom Files Pipeline in Scrapy never downloads Files even ...
Coding example for the question Custom Files Pipeline in Scrapy never downloads Files even though logs should all functions being accessed.
Read more >
how to download and save a file with scrapy
I could crawl inside the site and get to the form I need and then I find two buttons ... to download files,...
Read more >
Learn How to Download Files with Scrapy
If you don't know what web scraping, you will get a general idea from ... The default implementation of the FilesPipeline does not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found