Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MediaPipeline (and ImagesPipeline/FilesPipeline) does not handle HTTP redirections

See original GitHub issue

Basically, what’s happened is that my spider is unable to download the files because the file_urls provided are actually redirected to the final download link. However, because of the following code, the redirect download middleware is effectively disabled, which makes the download fail.

def _check_media_to_download(self, result, request, info):
        if result is not None:
            return result
        if self.download_func:
            # this ugly code was left only to support tests. TODO: remove
            dfd = mustbe_deferred(self.download_func, request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        else:
            request.meta['handle_httpstatus_all'] = True
            dfd = self.crawler.engine.download(request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        return dfd

And here is the check in the redirect middleware that disables it:

if (request.meta.get('dont_redirect', False) or
                response.status in getattr(spider, 'handle_httpstatus_list', []) or
                response.status in request.meta.get('handle_httpstatus_list', []) or
                request.meta.get('handle_httpstatus_all', False)):
            return response

My question is: What is the point of enabling the handling of httpstatus_all, when it effectively disables all of the checks?

Issue Analytics

State:
Created 7 years ago
Reactions:12
Comments:11 (4 by maintainers)

Top GitHub Comments

3reactions

Granitosauruscommented, Aug 22, 2016

@jbothma To answer your question:

This should all respect the allowed domains, right? Is that automatically taken care of by the downloader?

Yes since all requests scheduled go through download middlewares it will go through OffsiteMiddleware, so it will do the usual match to spider.allowed_domains.

Regarding the issue I think handle_httpstatus_all should be replaced with a handle_httpstatus_list that exclused some 300 codes. For FilesPipeline the callback media_downloaded() by default raises an exception on anything that is not 200 or has an empty body, so this pipeline doesn’t benefit from this setting at all other than wrapping the error. Excluding that from 300 codes wouldn’t break anything other than enabling the redirection, which might be unwanted for some reason but I bet it’s the other way around more often than not.

I think implementing a pipeline setting to manage this would be an ideal solution.

1reaction

SeanPollockcommented, Feb 23, 2017

Hi I ran in to this issue today too.

Any updates on this bug, or reason the pull request wasn’t merged?