Feeds Enhancement: Item Filters
See original GitHub issueSummary
Currently there are no convenient ways to filter items before they can be exported. An ItemChecker
class can be used to filter items while also providing flexibility to the user.
Motivation/Proposal
Scrapy currently doesn’t have any convenient APIs to customize conditions for item exports. An ItemChecker
class can be used by the user to define constraints for acceptable items for particular feeds.
The ItemChecker
class can have 3 main public methods accepts
, accepts_class
and accepts_fields
. Scrapy will mainly use accepts
method to decide if an item is acceptable, accepts_class
and accepts_fields
will have certain default behaviors which can be overriden by the user should they want to customize them.
class ItemChecker:
"""
This will be used by FeedExporter to decide if an item should be allowed
to be exported to a particular feed.
:param feed_options: FEEDS dictionary passed from FeedExporter
:type feed_options: dict
"""
accepted_items = [] # list of Items user wants to accept
def __init__(self, feed_options):
# populate accepted_items with item_classes values from feed_options if present
def accepts(self, item):
"""
Main method to be used by FeedExporter to check if the item is acceptable according
to defined constraints. This method uses accepts_class and accept_fields method
to decide if the item is acceptable.
:param item: scraped item which user wants to check if is acceptable
:type item: scrapy supported items (dictionaries, Item objects, dataclass objects, and attrs objects)
:return: `True` if accepted, `False` otherwise
:rtype: bool
"""
def accepts_class(self, item):
"""
Method to check if the item is an instance of a class declared in accepted_items
list. Can be overriden by user if they want allow certain item classes.
Default behaviour: if accepted_items is empty then all items will be
accepted else only items present in accepted_items will be accepted.
:param item: scraped item
:type item: scrapy supported items (dictionaries, Item objects, dataclass objects, and attrs objects)
:return: `True` if item in accepted_items, `False` otherwise
:rtype: bool
"""
def accepts_fields(self, fields):
"""
Method to check if certain fields of the item passes the filtering
criteria. Users can override this method to add their own custom
filters.
Default behaviour: accepts all fields.
:param fields: all the fields of the scraped item
:type fields: dict
:return: `True` if all the fields passes the filtering criteria, else `False`
:rtype: bool
"""
Such custom filters can be declared in settings.py
. For convenience Items can also be declared here without needing to create a custom ItemChecker
class.
from myproject.filterfile import MyFilter1
from myproject.items import MyItem1, MyItem2
FEEDS = {
'items1.json': {
'format': 'json',
'item_filter': MyFilter1,
},
'items2.xml': {
'format': 'xml',
'item_classes': (MyItem1, MyItem2),
},
}
Describe alternatives you’ve considered
This feature builds upon #4576.
Additional context
This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
I think
accepts_class
will do it. I’ll update accordingly.OK, so you are saying that there is going to be 1 instance of the indicated item-filtering class per feed. Makes sense. I also see you have updated the API to expect
feed_options
in__init__
accordingly.My only remaining feedback is that
accepts_item
sounds a bit ambiguous: I would expectItemChecker.accepts
andItemChecker.accepts_item
, both of which get an item as parameter, to do the same. What about something more in line withaccepts_fields
, likeaccepts_class
oraccepts_type
?