question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to use the Rule in CrawlSpider to track the response that Splash returns

See original GitHub issue

I would like to use Rule to track the Splash rendering Response! But using SplashRequest, Rule does not take effect. Then use the rule of the process_request, re-set the Request object URL, written into the Splash HTTP API request.

`class MySpider(CrawlSpider):

name = 'innda'

def start_requests(self):
    yield Request(url)
    

rules = (
    Rule(LinkExtractor(allow=('node_\d+\.htm',)), process_request='splash_request', follow=True),
    Rule(LinkExtractor(allow=('content_\d+\.htm',)), callback="one_parse")
)


def splash_request(self, request):
    request = request.replace(url=RENDER_HTML_URL + request.url)
    return request

` But the relative path URL is replaced by the Splash url.

What is the solution

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

5reactions
redapplecommented, Jul 11, 2017

What about doing something like this?

def splash_request(self, request):
    return SplashRequest(url=request.url, args={'wait': 10}, meta={'real_url': request.url})

Why are you setting dont_process_response=True?

4reactions
janwendtcommented, Aug 27, 2020

If someone runs into the same problem needing to use Splash in a CrawlSpider (with Rule and LinkExtractor) BOTH for parse_item and the initial start_requests that’s my solution:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest, SplashJsonResponse, SplashTextResponse
from scrapy.http import HtmlResponse

class Abc(scrapy.Item):
    name = scrapy.Field()

class AbcSpider(CrawlSpider):
    name = "abc"
    allowed_domains = ['abc.de']
    start_urls = ['https://www.abc.com/xyz']

    rules = (Rule(LinkExtractor(restrict_xpaths='//h2[@class="abc"]'), callback='parse_item', process_request="use_splash"))

    def start_requests(self):        
        for url in self.start_urls:
            yield SplashRequest(url, args={'wait': 15}, meta={'real_url': url})

    def use_splash(self, request):
        request.meta['splash'] = {
                'endpoint':'render.html',
                'args':{
                    'wait': 15,
                    }
                }
        return request

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def parse_item(self, response):
        item = Abc()
        item['name'] = response.xpath('//div[@class="abc-name"]/h1/text()').get()
        return item
Read more comments on GitHub >

github_iconTop Results From Across the Web

How to pass splash cookie to Scrapy Rule attribute in ...
I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github...
Read more >
How to pass splash cookie to Scrapy Rule attribute to ... - Reddit
I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github...
Read more >
Is this the right way to use scrapyJs with CrawlSpider?
Hello,. You probably want to use Splash for Requests that CrawlSpider generates from the rules. See `process_request` argument when defining CrawlSpider Rules.
Read more >
Spiders — Scrapy 2.7.1 documentation
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow.
Read more >
Crawlspider and Splash - Support - Zyte
Hi there, i coded a normal spider using splash and ur great samples on github ... a similar job like a crawlspider but...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found