Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to get redirect urls with scrapy-splash

See original GitHub issue

I don’t know how to get the redirect urls with scrapy-splash,can you help me? eg. http://xxx.xxx.xxx/1.php will redirect to http://xxx.xxx.xxx/index.php,how can I get http://xxx.xxx.xxx/index.php with scrapy-splash? Below is my code which can not get http://xxx.xxx.xxx/index.php but get http://xxx.xxx.xxx/1.php

    def parse_get(self, response):
        item = CrawlerItem()
        item['code'] = response.status
        item['current_url'] = response.url
        ############################# below print http://xxx.xxx.xxx/1.php
        print(response.url)


self.lua_script = """
        function main(splash, args)
          assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
              ['Cookie']='%s',
              }
              }
              )
          assert(splash:wait(0.5))

          splash:on_request(function(request)
              request:set_proxy{
                  host = "%s",
                  port = %d
              }
          end)

          return {cookies = splash:get_cookies(),html=splash:html()}
        end
        """ % (self.cookie,a[0],a[1])

url='http://xxx.xxx.xxx/1.php'
SplashRequest(url, self.parse_get, endpoint='execute', magic_response=True, meta={'handle_httpstatus_all': True}, args={'lua_source': self.lua_script})

Issue Analytics

State:
Created 6 years ago
Comments:15 (6 by maintainers)

Top GitHub Comments

3reactions

civanescucommented, Jan 22, 2019

So, is there any solution to see redirected url (the new one) inside scrapy-splash?

2reactions

lopuhincommented, Nov 29, 2017

@3xp10it splash handles redirects by itself, so the result you are getting is from a page where it was redirected. To get it’s URL, you can add url = splash:url() to return values (see example in README below “Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values”) - after that response.url should be from the redirected page.