Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FilesPipeline.file_path always getting response=None

See original GitHub issue

Description

As said in the wiki, file_path is a method from scrapy.pipelines.files.FilesPipeline called once per downloaded item.

It returns the download path of the file originating from the specified response.

I’m trying to extend this class and the method file_path is always getting response=None. The actual default implementation of this method is poor because relies on the extension to be in the url instead of the headers Content-Type:

def file_path(self, request, response=None, info=None):
    media_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
    media_ext = os.path.splitext(request.url)[1]
    # Handles empty and wild extensions by trying to guess the
    # mime type then extension or default to empty string otherwise
    if media_ext not in mimetypes.types_map:
        media_ext = ''
        media_type = mimetypes.guess_type(request.url)[0]
        if media_type:
            media_ext = mimetypes.guess_extension(media_type)
    return 'full/%s%s' % (media_guid, media_ext)

For example for this url the extension is not in the url but in the headers instead.

Steps to Reproduce

Extend file_path method from FilesPipeline class

Expected behavior: get the actual response

Actual behavior: response=None always

Reproduces how often: 100%

Versions

Scrapy 2.0.1

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

scratchmexcommented, Mar 28, 2020

I was able to workaround with this checking if there was a response and if not returing None. This actually worked.

def file_path(self, request, response=None, info=None):
    if not response:
        return
    else:
        ext = mimetypes.guess_extension(response.headers.get('Content-Type').decode("utf"))

Searching in the code, this function is actually called in 3 places in scrapy.pipelines.files.FilesPipeline:

# line 507
def file_downloaded(self, response, request, info):
    path = self.file_path(request, response=response, info=info)
    buf = BytesIO(response.body)
    checksum = md5sum(buf)
    buf.seek(0)
    self.store.persist_file(path, buf, info)
    return checksum

def media_to_download(self, request, info):
# [...]
# line 422
    path = self.file_path(request, info=info)
    dfd = defer.maybeDeferred(self.store.stat_file, path, info)

def media_downloaded(self, response, request, info):
# [...]
# line 477
try:
    path = self.file_path(request, response=response, info=info)
    checksum = self.file_downloaded(response, request, info)

so I assumed that file_path is called before and after the download of the file which led me to return None if no response is supplied.

I don’t know how is the stack of calls with this pipeline but I think it would be great if the default implementation could gather the extension from the Content-Type header. If you could give me some references of how this class works or suggestions of how to implement this I would like to work on this.

0reactions

lubobill1990commented, Jun 6, 2020

When I enable HTTP cache, at the first run, where there is no cache, then the response is not none, but after that, when there is cache, then response is none.

Top Results From Across the Web

Scrapy Override file_path from FilesPipeline - python

Btw, I'm new on python - scrapy. pipelines.py from scrapy.pipelines.files import FilesPipeline class secFilesPipeline(FilesPipeline): ...

Downloading and processing files and images

The ImagesPipeline is an extension of the FilesPipeline , customizing the field names and adding custom behavior for images. file_path(self, request, response ......

Custom Files Pipeline in Scrapy never downloads Files even ...

Coding example for the question Custom Files Pipeline in Scrapy never downloads Files even though logs should all functions being accessed.

how to download and save a file with scrapy

I could crawl inside the site and get to the form I need and then I find two buttons ... to download files,...

Learn How to Download Files with Scrapy

If you don't know what web scraping, you will get a general idea from ... The default implementation of the FilesPipeline does not...