question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feeds Enhancement: Item Filters

See original GitHub issue

Summary

Currently there are no convenient ways to filter items before they can be exported. An ItemChecker class can be used to filter items while also providing flexibility to the user.

Motivation/Proposal

Scrapy currently doesn’t have any convenient APIs to customize conditions for item exports. An ItemChecker class can be used by the user to define constraints for acceptable items for particular feeds.

The ItemChecker class can have 3 main public methods accepts, accepts_class and accepts_fields. Scrapy will mainly use accepts method to decide if an item is acceptable, accepts_class and accepts_fields will have certain default behaviors which can be overriden by the user should they want to customize them.

class ItemChecker:
   """
   This will be used by FeedExporter to decide if an item should be allowed
   to be exported to a particular feed.
   :param feed_options: FEEDS dictionary passed from FeedExporter
   :type feed_options: dict
   """
   accepted_items = []    # list of Items user wants to accept

    def __init__(self, feed_options):
        # populate accepted_items with item_classes values from feed_options if present

    def accepts(self, item):
        """
        Main method to be used by FeedExporter to check if the item is acceptable according
        to defined constraints. This method uses accepts_class and accept_fields method
        to decide if the item is acceptable.
        :param item: scraped item which user wants to check if is acceptable
        :type item: scrapy supported items (dictionaries, Item objects, dataclass objects, and attrs objects)
        :return: `True` if accepted, `False` otherwise
        :rtype: bool
        """

    def accepts_class(self, item):
        """
        Method to check if the item is an instance of a class declared in accepted_items
        list. Can be overriden by user if they want allow certain item classes.
        Default behaviour: if accepted_items is empty then all items will be
        accepted else only items present in accepted_items will be accepted.
        :param item: scraped item
        :type item: scrapy supported items  (dictionaries, Item objects, dataclass objects, and attrs objects)
        :return: `True` if item in accepted_items, `False` otherwise
        :rtype: bool
        """

    def accepts_fields(self, fields):
        """
        Method to check if certain fields of the item passes the filtering
        criteria. Users can override this method to add their own custom
        filters.
        Default behaviour: accepts all fields.
        :param fields: all the fields of the scraped item
        :type fields: dict
        :return: `True` if all the fields passes the filtering criteria, else `False`
        :rtype: bool
        """

Such custom filters can be declared in settings.py. For convenience Items can also be declared here without needing to create a custom ItemChecker class.

from myproject.filterfile import MyFilter1
from myproject.items import MyItem1, MyItem2

FEEDS = {
    'items1.json': {
        'format': 'json',
        'item_filter': MyFilter1,
    },
    'items2.xml': {
        'format': 'xml',
        'item_classes': (MyItem1, MyItem2),
    },
}

Describe alternatives you’ve considered

This feature builds upon #4576.

Additional context

This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
drs-11commented, May 27, 2021

My only remaining feedback is that accepts_item sounds a bit ambiguous: I would expect ItemChecker.accepts and ItemChecker.accepts_item, both of which get an item as parameter, to do the same. What about something more in line with accepts_fields, like accepts_class or accepts_type?

I think accepts_class will do it. I’ll update accordingly.

0reactions
Gallaeciocommented, May 27, 2021

OK, so you are saying that there is going to be 1 instance of the indicated item-filtering class per feed. Makes sense. I also see you have updated the API to expect feed_options in __init__ accordingly.

My only remaining feedback is that accepts_item sounds a bit ambiguous: I would expect ItemChecker.accepts and ItemChecker.accepts_item, both of which get an item as parameter, to do the same. What about something more in line with accepts_fields, like accepts_class or accepts_type?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Create Custom Feed Filters for Case Feed
Custom feed filters help support agents focus on the items that are most relevant for them.Required Editions and User Permissions Available in: Enterprise....
Read more >
Marty Zigman on "Enhance NetSuite Forms with Item Filters"
This article is relevant if you need to have your NetSuite transactions present a limited list of items for selection.
Read more >
Enhancing recommendation filters by filtering on item ...
We're pleased to announce enhancements to recommendation filters in Amazon Personalize, which provide you greater control on recommendations ...
Read more >
feed_links_extra needs filter to remove a feed from pages
Enhancement : feed_links_extra needs filter to remove a feed from pages ... for things like 'singletitle' and if false, skip the creation of...
Read more >
Map View from News Feed, Faster Filter & More Tab ...
Map View from News Feed, Faster Filter & More Tab Enhancements ... Some items from the More tab (three lines) will now be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found