MediaPipeline (and ImagesPipeline/FilesPipeline) does not handle HTTP redirections
See original GitHub issueBasically, what’s happened is that my spider is unable to download the files because the file_urls provided are actually redirected to the final download link. However, because of the following code, the redirect download middleware is effectively disabled, which makes the download fail.
def _check_media_to_download(self, result, request, info):
if result is not None:
return result
if self.download_func:
# this ugly code was left only to support tests. TODO: remove
dfd = mustbe_deferred(self.download_func, request, info.spider)
dfd.addCallbacks(
callback=self.media_downloaded, callbackArgs=(request, info),
errback=self.media_failed, errbackArgs=(request, info))
else:
request.meta['handle_httpstatus_all'] = True
dfd = self.crawler.engine.download(request, info.spider)
dfd.addCallbacks(
callback=self.media_downloaded, callbackArgs=(request, info),
errback=self.media_failed, errbackArgs=(request, info))
return dfd
And here is the check in the redirect middleware that disables it:
if (request.meta.get('dont_redirect', False) or
response.status in getattr(spider, 'handle_httpstatus_list', []) or
response.status in request.meta.get('handle_httpstatus_list', []) or
request.meta.get('handle_httpstatus_all', False)):
return response
My question is: What is the point of enabling the handling of httpstatus_all, when it effectively disables all of the checks?
Issue Analytics
- State:
- Created 7 years ago
- Reactions:12
- Comments:11 (4 by maintainers)
Top Results From Across the Web
Redirections in HTTP - MDN Web Docs - Mozilla
HTTP redirects are the best way to create redirections, but sometimes you don't have control over the server. In that case, try a...
Read more >The Ultimate Guide to Redirects: URL Redirections Explained
Redirects are important because they: Forward traffic from one URL to another when the old URL no longer exists; Forward authority when ...
Read more >HTTP redirect codes for SEO explained - ContentKing
The HTTP redirect code, redirect for short, is a way to forward visitors and search engines from one URL to another. Redirects are...
Read more >Implementing automatic redirects on the server side instead of ...
If the status is not explicitly set, the redirect response sends an HTTP status code ... Developers can configure the Apache Web server...
Read more >Handling Redirects@Edge Part 1 - Amazon AWS
A HTTP URL redirect is a webserver function that redirects a user to a ... If the content is not in that edge...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jbothma To answer your question:
Yes since all requests scheduled go through download middlewares it will go through OffsiteMiddleware, so it will do the usual match to spider.allowed_domains.
Regarding the issue I think
handle_httpstatus_all
should be replaced with ahandle_httpstatus_list
that exclused some 300 codes. ForFilesPipeline
the callbackmedia_downloaded()
by default raises an exception on anything that is not 200 or has an empty body, so this pipeline doesn’t benefit from this setting at all other than wrapping the error. Excluding that from 300 codes wouldn’t break anything other than enabling the redirection, which might be unwanted for some reason but I bet it’s the other way around more often than not.I think implementing a pipeline setting to manage this would be an ideal solution.
Hi I ran in to this issue today too.
Any updates on this bug, or reason the pull request wasn’t merged?