Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to use the Rule in CrawlSpider to track the response that Splash returns

See original GitHub issue

I would like to use Rule to track the Splash rendering Response! But using SplashRequest, Rule does not take effect. Then use the rule of the process_request, re-set the Request object URL, written into the Splash HTTP API request.

`class MySpider(CrawlSpider):

name = 'innda'

def start_requests(self):
    yield Request(url)
    

rules = (
    Rule(LinkExtractor(allow=('node_\d+\.htm',)), process_request='splash_request', follow=True),
    Rule(LinkExtractor(allow=('content_\d+\.htm',)), callback="one_parse")
)


def splash_request(self, request):
    request = request.replace(url=RENDER_HTML_URL + request.url)
    return request

` But the relative path URL is replaced by the Splash url.

What is the solution

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:11 (3 by maintainers)

Top GitHub Comments

5reactions

redapplecommented, Jul 11, 2017

What about doing something like this?

def splash_request(self, request):
    return SplashRequest(url=request.url, args={'wait': 10}, meta={'real_url': request.url})

Why are you setting dont_process_response=True?

4reactions

janwendtcommented, Aug 27, 2020

If someone runs into the same problem needing to use Splash in a CrawlSpider (with Rule and LinkExtractor) BOTH for parse_item and the initial start_requests that’s my solution:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest, SplashJsonResponse, SplashTextResponse
from scrapy.http import HtmlResponse

class Abc(scrapy.Item):
    name = scrapy.Field()

class AbcSpider(CrawlSpider):
    name = "abc"
    allowed_domains = ['abc.de']
    start_urls = ['https://www.abc.com/xyz']

    rules = (Rule(LinkExtractor(restrict_xpaths='//h2[@class="abc"]'), callback='parse_item', process_request="use_splash"))

    def start_requests(self):        
        for url in self.start_urls:
            yield SplashRequest(url, args={'wait': 15}, meta={'real_url': url})

    def use_splash(self, request):
        request.meta['splash'] = {
                'endpoint':'render.html',
                'args':{
                    'wait': 15,
                    }
                }
        return request

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def parse_item(self, response):
        item = Abc()
        item['name'] = response.xpath('//div[@class="abc-name"]/h1/text()').get()
        return item

Top Results From Across the Web

How to pass splash cookie to Scrapy Rule attribute in ...

I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github...

How to pass splash cookie to Scrapy Rule attribute to ... - Reddit

I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github...

Is this the right way to use scrapyJs with CrawlSpider?

Hello,. You probably want to use Splash for Requests that CrawlSpider generates from the rules. See `process_request` argument when defining CrawlSpider Rules.

Spiders — Scrapy 2.7.1 documentation

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow.

Crawlspider and Splash - Support - Zyte

Hi there, i coded a normal spider using splash and ur great samples on github ... a similar job like a crawlspider but...