Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fallback parser rules in ItemLoader - discussion for spider maintenance

See original GitHub issue

Related to issue #3771

I’m stoked with the idea of having the ItemLoader support fallback parsers in any API possible as Scrapy needs to provide convenient ways for developers to keep up with the site changes. However, some sites perform layout changes more often than others, and some of the fallback parser rules gets obsolete real fast, posing a problem in the spiders’ long term maintenance.

With this, the main challenge would be determining if a given fallback css/xpath rule in the parser is safe to remove (meaning that it hasn’t been encountered anywhere during a crawl). We could confirm this via looking at the distribution of how many times a fallback xpath/css rule was used for the full spider job.

I’d like to discuss the idea of:

how should this information be better presented?
where might we put this info on, via the logs? via stats?
should this feature be put into the ItemLoader class itself? or should it be subclassed for better backward compatibility (as this might pose to have an effect on performance)?

and lastly, should this feature be even worthy of being implemented in Scrapy itself? or should it be implemented on a separate repo as a Scrapy plugin?

Cheers!

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:8 (6 by maintainers)

Top GitHub Comments

2reactions

BurnzZcommented, Jun 19, 2019

Hi everyone! As discussed, we’ll start this one out as a separate Scrapy plugin. I’ve begun the development in https://github.com/BurnzZ/scrapy-loader-upkeep with the bare minimum working components for the Stats API. Cheers!

2reactions

peononecommented, May 27, 2019

Cool idea, it will be helpful to simplify the case a few expressions are needed, and also detecting the outdated expression. I prefer to have it as a Scrapy Plugin, and use stats rather than logs for the hit count.

Top Results From Across the Web

Item Loaders — Scrapy 2.5.0 documentation

Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, ......

scrapy-loader-upkeep - PyPI

This allows developers to keep track of how often parsers are being used on a crawl, allowing to safely remove obsolete css/xpath fallback...

Scrapy Documentation - Read the Docs

field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.

Scrapy - Item Loaders - GeeksforGeeks

Maintenance, becomes difficult, as the project grows, and, also the number of spiders, written for data scraping. Also, the parsing rules may ...

Demystifying Scrapy Item Loaders - Towards Data Science

Automate scrapy data cleaning and scaling your scrapy spiders ... We start with the parse function this is where the ItemLoader instance ...