question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

allow spiders to return dicts instead of Items

See original GitHub issue

In many cases the requirement to define and yield Items from a spider is an unnecessary complication.

An example from Scrapy tutorial:

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

It can be made simpler with dicts instead of Items:

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            yield {
                'title': sel.xpath('a/text()').extract(),
                'link': sel.xpath('a/@href').extract(),
                'desc': sel.xpath('text()').extract(),
            }

The version with dicts gives a developer less concepts to learn, and it is easier to explain.

When field metadata is not used and data is exported to JSON/XML yielding Python dicts should be enough. Even when you export to CSV dicts could be enough - columns can be set explicitly by an user.

This should also prevent tickets like https://github.com/scrapy/scrapy/issues/968.

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Reactions:3
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
shaneaevanscommented, Mar 11, 2015

👍 It will make it easier for newcomers and make Scrapy more suitable for small projects. This also means anyting that acts like a dict would be suitable as an item.

@nramirezuy - For well defined projects maybe it would make sense to add a pipeline that requires items to be of a certain type, in addition to any other validation you have.

1reaction
eliasdornelescommented, Mar 10, 2015

+1, will be nice to be able to yield dicts! Less concepts for the newbie user. 😉

Read more comments on GitHub >

github_iconTop Results From Across the Web

(Python 3) Spider must return Request, BaseItem, dict or None ...
The program should be return scrapy 'Item' class objects and non generators, so I'm unsure why it is returning a generator. Any advice?...
Read more >
Spiders — Scrapy 2.7.1 documentation
It allows to parse the initial responses and must return either an item object, a Request object, or an iterable containing any of...
Read more >
Scrapy - Spiders - Tutorialspoint
Scrapy - Spiders, Spider is a class responsible for defining how to ... It returns either item or request object by allowing to...
Read more >
An Introduction | Python Scrapy Tutorial - Great Learning
Scrapy will return all the spiders that are there in the project ... Item behaves the same way as the standard dict API...
Read more >
API Reference — scrapinghub 2.4.0 documentation
Returns : an item dictionary if exists. Return type: dict ... Instead, this allows you to process it chunk by chunk. You can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found