Fallback parser rules in ItemLoader - discussion for spider maintenance
See original GitHub issueRelated to issue #3771
I’m stoked with the idea of having the ItemLoader
support fallback parsers in any API possible as Scrapy needs to provide convenient ways for developers to keep up with the site changes. However, some sites perform layout changes more often than others, and some of the fallback parser rules gets obsolete real fast, posing a problem in the spiders’ long term maintenance.
With this, the main challenge would be determining if a given fallback css/xpath rule in the parser is safe to remove (meaning that it hasn’t been encountered anywhere during a crawl). We could confirm this via looking at the distribution of how many times a fallback xpath/css rule was used for the full spider job.
I’d like to discuss the idea of:
- how should this information be better presented?
- where might we put this info on, via the
logs
? viastats
? - should this feature be put into the
ItemLoader
class itself? or should it be subclassed for better backward compatibility (as this might pose to have an effect on performance)?
and lastly, should this feature be even worthy of being implemented in Scrapy itself? or should it be implemented on a separate repo as a Scrapy plugin?
Cheers!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:8 (6 by maintainers)
Top GitHub Comments
Hi everyone! As discussed, we’ll start this one out as a separate Scrapy plugin. I’ve begun the development in https://github.com/BurnzZ/scrapy-loader-upkeep with the bare minimum working components for the Stats API. Cheers!
Cool idea, it will be helpful to simplify the case a few expressions are needed, and also detecting the outdated expression. I prefer to have it as a Scrapy Plugin, and use stats rather than logs for the hit count.