question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feeds Enhancement: Batch Delivery Triggers

See original GitHub issue

Summary

Batch creation was a recently introduced feature but was limited to only item count constraints. Batch Triggers will be able to make the constraints flexible and give more control to user to create batches.

Motivation/Proposal

Batch delivery can be simplified and made extensible by creating a class Batch. It will contain information about current batch so it can be used to detect when a given contraint has been exceeded and new batch needs to be created.

BatchPerXItems, BatchPerXMins, and BatchPerXBytes can be created as builtins with each inhereting parent class Batch. A parameter value para_val is stored in the Batch class which will be used to compare against the declared constraint. A para_val for BatchPerXItems can be the number of total items currently in the batch, for BatchPerXMins it can be total mins passed since it was created. So some update calls to update the para_val may not require information from slot_file.

Batch class prototype:

class Batch:
    """
    Batch which will store information for current batches and provides
    suitable methods to check and update batch info.
    :param file: file of the current batch
    :param constraint: a constraint or limit to figure out when a new batch
        must be created.
    :param para_val: a parameter value which will be updated and be compared
        against constraint to control batch creation.
    :param batch_id: id number of the current batch.
    """

    def __init__(self, feed_options):
        self.feed_options = feed_options
        self.constraint = self.feed_options.get("constraint")
        self.para_val = 0
        self.batch_id = 1

    def update(self):
        """
        Updates the parameter value according to stats related to paramter
        value and contraint.
        """

    def should_trigger(self):
        """
        Checks if para_val has crossed the constraint or not.
        :return: `True` if para_val has crossed constraint, else `False`
        :rtype: bool
        """

    def new_batch(self, file):
        """
        Resets parameter value back to its initial value and increments
        self.batch_id. Will be used before starting a new batch. Assigns
        file to current batch's file attribute.
        """

    def get_batch_state(self):
        """
        Returns a dict containing batch attributes with their current value.
        """

Desired Batch class can then be activated in settings.py with a constraint. To help users create complex triggers they could set the constraint to any builtin type or any arbitrary object to suit their needs. If no constraint is set, it will be pointless to load the specified Batch class. Users can add their own custom Batch class by specifying their class path.

settings.py example:

from myproject.customclassfile import CustomBatch
from scrapy.utils.feedbatch import BatchByXItems     # one of the builtin triggers

{
    'items1.json': {
        'format': 'json',
        'batch': {
             'class': BatchByXItems,
             'item_count': 100,
        },
    },
    'items2.xml': {
        'format': 'xml',
        'batch': {
             'class': CustomBatch,
             'some_constraint': 100,
             'another_constraint': 50,
        },
    },
}


To stop and create a new batch from the Spider itself a signal can be used. This will require the user to create a method self.trigger_batch(feed_uri) which will send signal signal.stop_batch with the feed’s URI as argument which can then be intercepted by FeedExporter and appropriately call a method to stop and start a new batch for the specified feed.

Example custom spider code:

class MySpider(Spider):
    # ...
    # user spider code
    # ...

    def trigger_batch(self, uri):
        self.crawler.send_catch_log(
            signal=signals.stop_batch,
            uri=uri
        )

Describe alternatives you’ve considered

Customizing current codebase so that user can create their own trigger will be very cumbersome and inconvenient.

Additional context

This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
Gallaeciocommented, May 31, 2021

At the moment, we have the following API:

    '…': {
        'batch_item_count': 10,
    },

And there’s also the FEED_EXPORTS_BATCH_ITEM_COUNT setting.

I mention these 2 things just to emphasize that, even if we go for a different API now, we need to keep these in mind because we must continue to support them for backward compatibility. In order to do that, the API of batch classes will need access to feed options and settings.

Now, for the new API:

I don’t think FEED_BATCH_TRIGGER_BASE is necessary. We should make the API as simple as possible, and having a setting just to define aliases to allow for shortening strings seems unnecessary. In other words, 'batch_constraint': ('scrapy.utils.feedbatch.BatchByXItems', 10) would be OK. And in code, it could be just 'batch_constraint': (BatchByXItems, 10).

Also, I think parameters for batch classes should be arbitrary (you should be able to have any number of parameters) and be named rather than positional. So how about something more like:

    '…': {
        'batch': {
            'class': BatchByXItems,
            'item_count': 10,
        },
    },

This would allow, for example, to implement a single batch class that supports filtering by all suggested criteria (i.e. whatever condition happens first would trigger a new batch):

    '…': {
        'batch': {
            'class': Batch,
            'item_count': 10,
            'file_size': 10,
            'seconds': 10,
        },
    },
0reactions
drs-11commented, Jul 13, 2021

Modified the API a little and added a new method get_batch_state. This could be helpful in designing the signal trigger as well as internally in Feed Exporter.

One more addition I’d like to add is having a human readable units batch constraints. So for example:

{
    'items1.json': {
        'format': 'json',
        'batch': {
             'class': BatchByXItems,
             'file_size': '100 MB',
        },
    },
    'items2.xml': {
        'format': 'xml',
        'batch': {
             'class': BatchByXDuration,
             'duration': '3 hour',
        },
    },
}

Nothing major, thought this could make using those classes easier a little.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Automatic Batch determination in Delivery - SAP Community
- i need to pick the batch from VEPO for Delivery#Material#. - it may contain multiple batches for one item in one delivery....
Read more >
Schedule Triggers for Flows That Run for Batches of Records
A schedule-triggered flow starts at the specified time and frequency for a batch of records. Configure the schedule trigger in the Start element...
Read more >
Understanding Batch and Trigger Smart Campaigns
An example would be sending an email to all people in California. Batch smart campaigns will only have filters within the smart list...
Read more >
OWB 11gR2 – Trickle Feed Data Acquisition and Delivery
OWB has been enhanced to support triggering of execution of a trickle feed mapping on the arrival of a message in the trickle...
Read more >
Modernize Legacy Batch Job Platform Using Event-Driven ...
These output feeds were consumed by batch jobs in different job cycles ... a subset of these functions relevant to deliver on this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found