question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fallback parser rules in ItemLoader - discussion for spider maintenance

See original GitHub issue

Related to issue #3771


I’m stoked with the idea of having the ItemLoader support fallback parsers in any API possible as Scrapy needs to provide convenient ways for developers to keep up with the site changes. However, some sites perform layout changes more often than others, and some of the fallback parser rules gets obsolete real fast, posing a problem in the spiders’ long term maintenance.

With this, the main challenge would be determining if a given fallback css/xpath rule in the parser is safe to remove (meaning that it hasn’t been encountered anywhere during a crawl). We could confirm this via looking at the distribution of how many times a fallback xpath/css rule was used for the full spider job.

I’d like to discuss the idea of:

  1. how should this information be better presented?
  2. where might we put this info on, via the logs? via stats?
  3. should this feature be put into the ItemLoader class itself? or should it be subclassed for better backward compatibility (as this might pose to have an effect on performance)?

and lastly, should this feature be even worthy of being implemented in Scrapy itself? or should it be implemented on a separate repo as a Scrapy plugin?

Cheers!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
BurnzZcommented, Jun 19, 2019

Hi everyone! As discussed, we’ll start this one out as a separate Scrapy plugin. I’ve begun the development in https://github.com/BurnzZ/scrapy-loader-upkeep with the bare minimum working components for the Stats API. Cheers!

2reactions
peononecommented, May 27, 2019

Cool idea, it will be helpful to simplify the case a few expressions are needed, and also detecting the outdated expression. I prefer to have it as a Scrapy Plugin, and use stats rather than logs for the hit count.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Item Loaders — Scrapy 2.5.0 documentation
Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, ......
Read more >
scrapy-loader-upkeep - PyPI
This allows developers to keep track of how often parsers are being used on a crawl, allowing to safely remove obsolete css/xpath fallback...
Read more >
Scrapy Documentation - Read the Docs
field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
Read more >
Scrapy - Item Loaders - GeeksforGeeks
Maintenance, becomes difficult, as the project grows, and, also the number of spiders, written for data scraping. Also, the parsing rules may ...
Read more >
Demystifying Scrapy Item Loaders - Towards Data Science
Automate scrapy data cleaning and scaling your scrapy spiders ... We start with the parse function this is where the ItemLoader instance ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found