question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Promote a new CrawlSpider that allows overriding `parse`

See original GitHub issue

I see a lot of StackOverflow questions and problems with CrawlSpider and overriden parse methods. (e.g. https://stackoverflow.com/questions/23511230)

I’d like to see a new implementation, called CrawlSpider2 or CrawlingSpider or something, that uses another internal method with another name than parse, so that user could define their own parse method.

Then, the question is if users will expect this parse method to be used by default for each downloaded page (is addition to being parsed for links with Rules), or if the reference to parse should be explicit.

Thoughts?

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
nyovcommented, Sep 18, 2016

Closing #712 in favor of #712 …nice sleight of hand there. 😃

1reaction
nyovcommented, May 24, 2014

When you haven’t read the source and understand what’s going on, it’s easy to forget you should, or even need to call super(). Yes, the warning in the docs is also misleading, but someone reading it will know how to fix it (not using parse).

This issue is so common, it’s becoming a “useability bug” - not a bug in itself, but creating unnecessary headaches, not intuitive.

I propose having another internal _parse method called by the Scraper instead of parse, and using that in spiders who want internal “pre-processing”, then exposing parse as the public, documented, no-baggage method to implement/override from a spider.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CrawlSpider can't parse multipage in Scrapy - Stack Overflow
The CrawlSpider I've created is not doing it's job properly. It parses the first page and then stops without going on to the...
Read more >
Spiders — Scrapy 2.7.1 documentation
It allows to parse the initial responses and must return either an item object, a Request object, or an iterable containing any of...
Read more >
Scrapy 1.7.3 documentation
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new...
Read more >
How to download Files with Scrapy ? - GeeksforGeeks
xmlfeed: for parsing XML files. In this tutorial, we will be using the crawl spider template and building upon it further. To view...
Read more >
Spiders - Scrapy documentation - Read the Docs
This method provides a shortcut to signals.connect() for the spider_closed ... So if you override the parse method, the crawl spider will no...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found