Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scrapy-splash recursive crawl using CrawlSpider not working

See original GitHub issue

Hi !

I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:

 def process_request(self,request):
        request.meta['splash']={
            'args': {
                # set rendering arguments here
                'html': 1,
            }
        }
        return request

The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;

Thanks in advance,

Issue Analytics

State:
Created 7 years ago
Reactions:2
Comments:36 (2 by maintainers)

Top GitHub Comments

11reactions

dwj1324commented, Jun 8, 2017

I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    ...

However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won’t have any requests to follow.

8reactions

sp-philippe-ogercommented, Apr 26, 2019

@MontaLabidi Your solution worked for me.

This is how my code looks:


class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

Top Results From Across the Web

Scrapy-Splash recursive crawl using CrawlSpider not working

I have integrated scrapy-splash in my CrawlSpider and it only crawl renders the start_urls. Wondering how to have scrapy-splash crawl the ...

scrapy-splash recursive crawl using CrawlSpider not working

Coming soon: A brand new website interface for an even better experience!

Release notes — Scrapy 1.8.3 documentation

Security bug fix: When HttpProxyMiddleware processes a request with proxy metadata, and that proxy metadata includes proxy credentials, HttpProxyMiddleware ...

Crawl and Follow links with SCRAPY - YouTube

Scrapy is a powerful web scrapign framework for Python, we can use it to following links and crawl a website, in this case...

Scrapy Splash for Beginners - Example, Settings and Shell Use

In this video I will show you how to get scrapy working with splash. By sending our requests to the splash API we...