Promote a new CrawlSpider that allows overriding `parse`
See original GitHub issueI see a lot of StackOverflow questions and problems with CrawlSpider
and overriden parse
methods. (e.g. https://stackoverflow.com/questions/23511230)
I’d like to see a new implementation, called CrawlSpider2
or CrawlingSpider
or something, that uses another internal method with another name than parse
,
so that user could define their own parse
method.
Then, the question is if users will expect this parse
method to be used by default for each downloaded page (is addition to being parsed for links with Rules
),
or if the reference to parse
should be explicit.
Thoughts?
Issue Analytics
- State:
- Created 9 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
CrawlSpider can't parse multipage in Scrapy - Stack Overflow
The CrawlSpider I've created is not doing it's job properly. It parses the first page and then stops without going on to the...
Read more >Spiders — Scrapy 2.7.1 documentation
It allows to parse the initial responses and must return either an item object, a Request object, or an iterable containing any of...
Read more >Scrapy 1.7.3 documentation
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new...
Read more >How to download Files with Scrapy ? - GeeksforGeeks
xmlfeed: for parsing XML files. In this tutorial, we will be using the crawl spider template and building upon it further. To view...
Read more >Spiders - Scrapy documentation - Read the Docs
This method provides a shortcut to signals.connect() for the spider_closed ... So if you override the parse method, the crawl spider will no...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Closing #712 in favor of #712 …nice sleight of hand there. 😃
When you haven’t read the source and understand what’s going on, it’s easy to forget you should, or even need to call super(). Yes, the warning in the docs is also misleading, but someone reading it will know how to fix it (not using
parse
).This issue is so common, it’s becoming a “useability bug” - not a bug in itself, but creating unnecessary headaches, not intuitive.
I propose having another internal
_parse
method called by theScraper
instead ofparse
, and using that in spiders who want internal “pre-processing”, then exposingparse
as the public, documented, no-baggage method to implement/override from a spider.