Feeds Enhancement: Batch Delivery Triggers
See original GitHub issueSummary
Batch creation was a recently introduced feature but was limited to only item count constraints. Batch Triggers will be able to make the constraints flexible and give more control to user to create batches.
Motivation/Proposal
Batch delivery can be simplified and made extensible by creating a class Batch
. It will contain information about current batch so it can be used to detect when a given contraint has been exceeded and new batch needs to be created.
BatchPerXItems
, BatchPerXMins
, and BatchPerXBytes
can be created as builtins with each inhereting parent class Batch
. A parameter value para_val
is stored in the Batch
class which will be used to compare against the declared constraint
. A para_val
for BatchPerXItems
can be the number of total items currently in the batch, for BatchPerXMins
it can be total mins passed since it was created. So some update
calls to update the para_val
may not require information from slot_file
.
Batch class prototype:
class Batch:
"""
Batch which will store information for current batches and provides
suitable methods to check and update batch info.
:param file: file of the current batch
:param constraint: a constraint or limit to figure out when a new batch
must be created.
:param para_val: a parameter value which will be updated and be compared
against constraint to control batch creation.
:param batch_id: id number of the current batch.
"""
def __init__(self, feed_options):
self.feed_options = feed_options
self.constraint = self.feed_options.get("constraint")
self.para_val = 0
self.batch_id = 1
def update(self):
"""
Updates the parameter value according to stats related to paramter
value and contraint.
"""
def should_trigger(self):
"""
Checks if para_val has crossed the constraint or not.
:return: `True` if para_val has crossed constraint, else `False`
:rtype: bool
"""
def new_batch(self, file):
"""
Resets parameter value back to its initial value and increments
self.batch_id. Will be used before starting a new batch. Assigns
file to current batch's file attribute.
"""
def get_batch_state(self):
"""
Returns a dict containing batch attributes with their current value.
"""
Desired Batch class can then be activated in settings.py
with a constraint. To help users create complex triggers they could set the constraint to any builtin type or any arbitrary object to suit their needs. If no constraint is set, it will be pointless to load the specified Batch class. Users can add their own custom Batch class by specifying their class path.
settings.py example:
from myproject.customclassfile import CustomBatch
from scrapy.utils.feedbatch import BatchByXItems # one of the builtin triggers
{
'items1.json': {
'format': 'json',
'batch': {
'class': BatchByXItems,
'item_count': 100,
},
},
'items2.xml': {
'format': 'xml',
'batch': {
'class': CustomBatch,
'some_constraint': 100,
'another_constraint': 50,
},
},
}
To stop and create a new batch from the Spider itself a signal can be used. This will require the user to create a method self.trigger_batch(feed_uri)
which will send signal signal.stop_batch
with the feed’s URI as argument which can then be intercepted by FeedExporter
and appropriately call a method to stop and start a new batch for the specified feed.
Example custom spider code:
class MySpider(Spider):
# ...
# user spider code
# ...
def trigger_batch(self, uri):
self.crawler.send_catch_log(
signal=signals.stop_batch,
uri=uri
)
Describe alternatives you’ve considered
Customizing current codebase so that user can create their own trigger will be very cumbersome and inconvenient.
Additional context
This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
At the moment, we have the following API:
And there’s also the
FEED_EXPORTS_BATCH_ITEM_COUNT
setting.I mention these 2 things just to emphasize that, even if we go for a different API now, we need to keep these in mind because we must continue to support them for backward compatibility. In order to do that, the API of batch classes will need access to feed options and settings.
Now, for the new API:
I don’t think
FEED_BATCH_TRIGGER_BASE
is necessary. We should make the API as simple as possible, and having a setting just to define aliases to allow for shortening strings seems unnecessary. In other words,'batch_constraint': ('scrapy.utils.feedbatch.BatchByXItems', 10)
would be OK. And in code, it could be just'batch_constraint': (BatchByXItems, 10)
.Also, I think parameters for batch classes should be arbitrary (you should be able to have any number of parameters) and be named rather than positional. So how about something more like:
This would allow, for example, to implement a single batch class that supports filtering by all suggested criteria (i.e. whatever condition happens first would trigger a new batch):
Modified the API a little and added a new method
get_batch_state
. This could be helpful in designing the signal trigger as well as internally in Feed Exporter.One more addition I’d like to add is having a human readable units batch constraints. So for example:
Nothing major, thought this could make using those classes easier a little.