question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MediaPipeline (and ImagesPipeline/FilesPipeline) does not handle HTTP redirections

See original GitHub issue

Basically, what’s happened is that my spider is unable to download the files because the file_urls provided are actually redirected to the final download link. However, because of the following code, the redirect download middleware is effectively disabled, which makes the download fail.

def _check_media_to_download(self, result, request, info):
        if result is not None:
            return result
        if self.download_func:
            # this ugly code was left only to support tests. TODO: remove
            dfd = mustbe_deferred(self.download_func, request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        else:
            request.meta['handle_httpstatus_all'] = True
            dfd = self.crawler.engine.download(request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        return dfd

And here is the check in the redirect middleware that disables it:

if (request.meta.get('dont_redirect', False) or
                response.status in getattr(spider, 'handle_httpstatus_list', []) or
                response.status in request.meta.get('handle_httpstatus_list', []) or
                request.meta.get('handle_httpstatus_all', False)):
            return response

My question is: What is the point of enabling the handling of httpstatus_all, when it effectively disables all of the checks?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:12
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
Granitosauruscommented, Aug 22, 2016

@jbothma To answer your question:

This should all respect the allowed domains, right? Is that automatically taken care of by the downloader?

Yes since all requests scheduled go through download middlewares it will go through OffsiteMiddleware, so it will do the usual match to spider.allowed_domains.

Regarding the issue I think handle_httpstatus_all should be replaced with a handle_httpstatus_list that exclused some 300 codes. For FilesPipeline the callback media_downloaded() by default raises an exception on anything that is not 200 or has an empty body, so this pipeline doesn’t benefit from this setting at all other than wrapping the error. Excluding that from 300 codes wouldn’t break anything other than enabling the redirection, which might be unwanted for some reason but I bet it’s the other way around more often than not.

I think implementing a pipeline setting to manage this would be an ideal solution.

1reaction
SeanPollockcommented, Feb 23, 2017

Hi I ran in to this issue today too.

Any updates on this bug, or reason the pull request wasn’t merged?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Redirections in HTTP - MDN Web Docs - Mozilla
HTTP redirects are the best way to create redirections, but sometimes you don't have control over the server. In that case, try a...
Read more >
The Ultimate Guide to Redirects: URL Redirections Explained
Redirects are important because they: Forward traffic from one URL to another when the old URL no longer exists; Forward authority when ...
Read more >
HTTP redirect codes for SEO explained - ContentKing
The HTTP redirect code, redirect for short, is a way to forward visitors and search engines from one URL to another. Redirects are...
Read more >
Implementing automatic redirects on the server side instead of ...
If the status is not explicitly set, the redirect response sends an HTTP status code ... Developers can configure the Apache Web server...
Read more >
Handling Redirects@Edge Part 1 - Amazon AWS
A HTTP URL redirect is a webserver function that redirects a user to a ... If the content is not in that edge...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found