How to use the Rule in CrawlSpider to track the response that Splash returns
See original GitHub issueI would like to use Rule to track the Splash rendering Response! But using SplashRequest, Rule does not take effect. Then use the rule of the process_request, re-set the Request object URL, written into the Splash HTTP API request.
`class MySpider(CrawlSpider):
name = 'innda'
def start_requests(self):
yield Request(url)
rules = (
Rule(LinkExtractor(allow=('node_\d+\.htm',)), process_request='splash_request', follow=True),
Rule(LinkExtractor(allow=('content_\d+\.htm',)), callback="one_parse")
)
def splash_request(self, request):
request = request.replace(url=RENDER_HTML_URL + request.url)
return request
` But the relative path URL is replaced by the Splash url.
What is the solution
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:11 (3 by maintainers)
Top Results From Across the Web
How to pass splash cookie to Scrapy Rule attribute in ...
I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github...
Read more >How to pass splash cookie to Scrapy Rule attribute to ... - Reddit
I am trying to maintain a user session while scraping a site using the splash session handling as described on the splash Github...
Read more >Is this the right way to use scrapyJs with CrawlSpider?
Hello,. You probably want to use Splash for Requests that CrawlSpider generates from the rules. See `process_request` argument when defining CrawlSpider Rules.
Read more >Spiders — Scrapy 2.7.1 documentation
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow.
Read more >Crawlspider and Splash - Support - Zyte
Hi there, i coded a normal spider using splash and ur great samples on github ... a similar job like a crawlspider but...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
What about doing something like this?
Why are you setting
dont_process_response=True
?If someone runs into the same problem needing to use Splash in a CrawlSpider (with Rule and LinkExtractor) BOTH for parse_item and the initial start_requests that’s my solution: