allow spiders to return dicts instead of Items
See original GitHub issueIn many cases the requirement to define and yield Items from a spider is an unnecessary complication.
An example from Scrapy tutorial:
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
It can be made simpler with dicts instead of Items:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
yield {
'title': sel.xpath('a/text()').extract(),
'link': sel.xpath('a/@href').extract(),
'desc': sel.xpath('text()').extract(),
}
The version with dicts gives a developer less concepts to learn, and it is easier to explain.
When field metadata is not used and data is exported to JSON/XML yielding Python dicts should be enough. Even when you export to CSV dicts could be enough - columns can be set explicitly by an user.
This should also prevent tickets like https://github.com/scrapy/scrapy/issues/968.
Issue Analytics
- State:
- Created 9 years ago
- Reactions:3
- Comments:8 (6 by maintainers)
Top Results From Across the Web
(Python 3) Spider must return Request, BaseItem, dict or None ...
The program should be return scrapy 'Item' class objects and non generators, so I'm unsure why it is returning a generator. Any advice?...
Read more >Spiders — Scrapy 2.7.1 documentation
It allows to parse the initial responses and must return either an item object, a Request object, or an iterable containing any of...
Read more >Scrapy - Spiders - Tutorialspoint
Scrapy - Spiders, Spider is a class responsible for defining how to ... It returns either item or request object by allowing to...
Read more >An Introduction | Python Scrapy Tutorial - Great Learning
Scrapy will return all the spiders that are there in the project ... Item behaves the same way as the standard dict API...
Read more >API Reference — scrapinghub 2.4.0 documentation
Returns : an item dictionary if exists. Return type: dict ... Instead, this allows you to process it chunk by chunk. You can...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
👍 It will make it easier for newcomers and make Scrapy more suitable for small projects. This also means anyting that acts like a dict would be suitable as an item.
@nramirezuy - For well defined projects maybe it would make sense to add a pipeline that requires items to be of a certain type, in addition to any other validation you have.
+1, will be nice to be able to yield dicts! Less concepts for the newbie user. 😉