Feeds Enhancement: Post-Processing
See original GitHub issueSummary
A Feed post-processing enhancement will enable plugins such as compression, minifying, beautifier, etc. which can then be added to the Feed exporting workflow.
Motivation/Proposal
A post-processing feature can help add more extensions dedicated to before-export-processing to scrapy. To help achieve extensibility, a PostProcessingManager
can be used which will use “plugin” like scrapy components to process the data before writing it to target files.
The PostProcessingManager
can act like a wrapper to the slot’s storage so whenever a write event takes place, the data is run through the plugins in a pipeline-ish way to be processed and then written to the target file.
A number of plugins can be created but there will be a need to specify the order in which these plugins are used as some won’t be able to process the data after it has been processed by another(eg: minifying won’t work on a compression processed file). These plugins will be required to have a certain Interface so that the PostProcessingManager
can use it without breaking down from unidentified components.
Few built-in plugins can be made such as for compressions: gzip, lzma, bz2.
PostProcessingManager class prototype:
class PostProcessingManager:
"""
This will manage and use declared plugins to process data in a
pipeline.
:param plugins: all the declared plugins for the uri
:type plugins: list
:param file: target file whose data will be processed before write
:type file: file like object
"""
def __init__(self, plugins, file):
# 1) load the plugins here
# 2) save file as an attribute
def write(self, data):
"""
Uses all the declared plugins to process data first, then writes
the processed data to target file.
:param data: data passed to be written to target file
:type data: bytes
:return: returns number of bytes written
:rtype: int
"""
def close(self):
"""
Close the target file along with all the plugins.
"""
PostProcessorPlugin class inteface:
class PostProcessorPlugin(Interface):
"""
Interface for plugins that will be used by PostProcessingManager. This will
provide necessary processing method.
"""
def __init__(self, file, feed_options):
"""
Initialize plugin with target file to which post-processed
data will be written and the feed-specific options.
"""
def write(self, data):
"""
Exposed method which will take data passed, process it and then
write it to target file.
:param data: data passed to be written to target file
:type data: bytes
:return: returns number of bytes written
:rtype: int
"""
def close(self):
"""
Closes this plugin wrapper.
"""
@staticmethod
def process(data):
"""
This will process the data and return it.
:param data: input data
:type data: bytes
:return: processed data
:rtype: bytes
"""
GzipPlugin example:
@implementer(PostProcessorPlugin)
class GzipPlugin:
COMPRESS_LEVEL = 9
def __init__(self. file):
# initialise various parameters for gzipping
self.file = gzip.GzipFile(fileobj=file, mode=file.mode,
compresslevel=self.COMPRESS_LEVEL)
def write(self, data):
return self.file.write(data)
def close(self):
self.file.close()
@staticmethod
def process(data):
return gzip.compress(data, compresslevel=self.COMPRESS_LEVEL)
settings.py example:
from myproject.pluginfile import MyPlugin
FEEDS = {
'item1.json' : {
'format': 'json',
'post-processing': ['gzip'],
},
'item2.xml' : {
'post-processing': [MyPlugin,'xz'], # order is important
},
}
POST_PROC_PLUGINS_BASE = {
'gzip': 'scrapy.utils.postprocessors.GzipPlugin',
'xz' : 'scrapy.utils.postprocessors.LZMAPlugin',
'bz2': 'scrapy.utils.postprocessors.Bz2Plugin',
}
Describe alternatives you’ve considered
This feature idea is actually an expansion on compression support(see #2174). Item Pipelines can be used as well for compression. But implementing this feature instead can give more options to user for post-processing while making it easier for the user to activate those post-processing components for specific feeds.
Additional context
This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (11 by maintainers)
Top GitHub Comments
While namespacing plugin settings into a dictionary sounds reasonable, in either of the approaches you suggest, I think the simplest approach, and the one most in line with existing code, is to let plugin-specific options be defined directly among
feed_options
, with the namespace built into each setting name. For example:This goes in line with the approach of Scrapy settings, where namespaces are built into the setting names. It also makes it easy for one setting to be interpreted by more than one plugin, which is not something I would generally recommend but I believe use-cases for that could exists.
This comment is exclusively about the plugin interface, which I think is the most important one (the manager will be an internal component, but users will write their own plugins).
I don’t think the processing method should be static. Plugins should be Scrapy components (i.e. it should be possible to add a
from_crawler
method to them to initialize them with access to settings), and hence they could read custom settings that condition their behavior. I think we need a regular method that can access object variables.And I think we should make the plugin interface similar to, for example,
zipfile.ZipFile
:So that plugin chaining could look like this:
Of course, with an arbitrary number of plugins, the manager code would have you actually call the open and close methods of those plugins, instead of using
with
, but I hope this example helps visualize the main idea: plugins should getwrite
calls, and in those calls they should be able to write into the next plugin (or the final file, in the case of the last plugin), but not required to (e.g. some plugins may store the input data internally, and write it all at once into the next plugin when the input data stops [close method gets called]).