question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scrapy-splash recursive crawl using CrawlSpider not working

See original GitHub issue

Hi !

I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:

 def process_request(self,request):
        request.meta['splash']={
            'args': {
                # set rendering arguments here
                'html': 1,
            }
        }
        return request

The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;

Thanks in advance,

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:2
  • Comments:36 (2 by maintainers)

github_iconTop GitHub Comments

11reactions
dwj1324commented, Jun 8, 2017

I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    ...

However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won’t have any requests to follow.

8reactions
sp-philippe-ogercommented, Apr 26, 2019

@MontaLabidi Your solution worked for me.

This is how my code looks:


class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy-Splash recursive crawl using CrawlSpider not working
I have integrated scrapy-splash in my CrawlSpider and it only crawl renders the start_urls. Wondering how to have scrapy-splash crawl the ...
Read more >
scrapy-splash recursive crawl using CrawlSpider not working
Coming soon: A brand new website interface for an even better experience!
Read more >
Release notes — Scrapy 1.8.3 documentation
Security bug fix: When HttpProxyMiddleware processes a request with proxy metadata, and that proxy metadata includes proxy credentials, HttpProxyMiddleware ...
Read more >
Crawl and Follow links with SCRAPY - YouTube
Scrapy is a powerful web scrapign framework for Python, we can use it to following links and crawl a website, in this case...
Read more >
Scrapy Splash for Beginners - Example, Settings and Shell Use
In this video I will show you how to get scrapy working with splash. By sending our requests to the splash API we...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found