FilesPipeline.file_path always getting response=None
See original GitHub issueDescription
As said in the wiki, file_path
is a method from scrapy.pipelines.files.FilesPipeline
called once per downloaded item.
It returns the download path of the file originating from the specified response.
I’m trying to extend this class and the method file_path
is always getting response=None
. The actual default implementation of this method is poor because relies on the extension to be in the url instead of the headers Content-Type:
def file_path(self, request, response=None, info=None):
media_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
media_ext = os.path.splitext(request.url)[1]
# Handles empty and wild extensions by trying to guess the
# mime type then extension or default to empty string otherwise
if media_ext not in mimetypes.types_map:
media_ext = ''
media_type = mimetypes.guess_type(request.url)[0]
if media_type:
media_ext = mimetypes.guess_extension(media_type)
return 'full/%s%s' % (media_guid, media_ext)
For example for this url the extension is not in the url but in the headers instead.
Steps to Reproduce
- Extend
file_path
method from FilesPipeline class
Expected behavior: get the actual response
Actual behavior: response=None
always
Reproduces how often: 100%
Versions
Scrapy 2.0.1
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Scrapy Override file_path from FilesPipeline - python
Btw, I'm new on python - scrapy. pipelines.py from scrapy.pipelines.files import FilesPipeline class secFilesPipeline(FilesPipeline): ...
Read more >Downloading and processing files and images
The ImagesPipeline is an extension of the FilesPipeline , customizing the field names and adding custom behavior for images. file_path(self, request, response ......
Read more >Custom Files Pipeline in Scrapy never downloads Files even ...
Coding example for the question Custom Files Pipeline in Scrapy never downloads Files even though logs should all functions being accessed.
Read more >how to download and save a file with scrapy
I could crawl inside the site and get to the form I need and then I find two buttons ... to download files,...
Read more >Learn How to Download Files with Scrapy
If you don't know what web scraping, you will get a general idea from ... The default implementation of the FilesPipeline does not...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was able to workaround with this checking if there was a response and if not returing
None
. This actually worked.Searching in the code, this function is actually called in 3 places in
scrapy.pipelines.files.FilesPipeline
:so I assumed that
file_path
is called before and after the download of the file which led me to returnNone
if no response is supplied.I don’t know how is the stack of calls with this pipeline but I think it would be great if the default implementation could gather the extension from the
Content-Type
header. If you could give me some references of how this class works or suggestions of how to implement this I would like to work on this.When I enable HTTP cache, at the first run, where there is no cache, then the response is not none, but after that, when there is cache, then response is none.