Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

allow spiders to return dicts instead of Items

See original GitHub issue

In many cases the requirement to define and yield Items from a spider is an unnecessary complication.

An example from Scrapy tutorial:

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

It can be made simpler with dicts instead of Items:

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            yield {
                'title': sel.xpath('a/text()').extract(),
                'link': sel.xpath('a/@href').extract(),
                'desc': sel.xpath('text()').extract(),
            }

The version with dicts gives a developer less concepts to learn, and it is easier to explain.

When field metadata is not used and data is exported to JSON/XML yielding Python dicts should be enough. Even when you export to CSV dicts could be enough - columns can be set explicitly by an user.

This should also prevent tickets like https://github.com/scrapy/scrapy/issues/968.

Issue Analytics

State:
Created 9 years ago
Reactions:3
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

shaneaevanscommented, Mar 11, 2015

👍 It will make it easier for newcomers and make Scrapy more suitable for small projects. This also means anyting that acts like a dict would be suitable as an item.

@nramirezuy - For well defined projects maybe it would make sense to add a pipeline that requires items to be of a certain type, in addition to any other validation you have.

1reaction

eliasdornelescommented, Mar 10, 2015

+1, will be nice to be able to yield dicts! Less concepts for the newbie user. 😉