question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feeds Enhancement: Post-Processing

See original GitHub issue

Summary

A Feed post-processing enhancement will enable plugins such as compression, minifying, beautifier, etc. which can then be added to the Feed exporting workflow.

Motivation/Proposal

A post-processing feature can help add more extensions dedicated to before-export-processing to scrapy. To help achieve extensibility, a PostProcessingManager can be used which will use “plugin” like scrapy components to process the data before writing it to target files.

The PostProcessingManager can act like a wrapper to the slot’s storage so whenever a write event takes place, the data is run through the plugins in a pipeline-ish way to be processed and then written to the target file.

A number of plugins can be created but there will be a need to specify the order in which these plugins are used as some won’t be able to process the data after it has been processed by another(eg: minifying won’t work on a compression processed file). These plugins will be required to have a certain Interface so that the PostProcessingManager can use it without breaking down from unidentified components.

Few built-in plugins can be made such as for compressions: gzip, lzma, bz2.

PostProcessingManager class prototype:

class PostProcessingManager:
    """
    This will manage and use declared plugins to process data in a
    pipeline.
    :param plugins: all the declared plugins for the uri
    :type plugins: list
    :param file: target file whose data will be processed before write
    :type file: file like object
    """

    def __init__(self, plugins, file):
        # 1) load the plugins here
        # 2) save file as an attribute

    def write(self, data):
        """
        Uses all the declared plugins to process data first, then writes
        the processed data to target file. 
        :param data: data passed to be written to target file
        :type data: bytes
        :return: returns number of bytes written
        :rtype: int
        """

    def close(self):
        """
        Close the target file along with all the plugins.
        """

PostProcessorPlugin class inteface:

class PostProcessorPlugin(Interface):
    """
    Interface for plugins that will be used by PostProcessingManager. This will
    provide necessary processing method.
    """

    def __init__(self, file, feed_options):
        """
        Initialize plugin with target file to which post-processed
        data will be written and the feed-specific options.
        """

    def write(self, data):
        """
        Exposed method which will take data passed, process it and then
        write it to target file.
        :param data: data passed to be written to target file
        :type data: bytes
        :return: returns number of bytes written
        :rtype: int
        """

    def close(self):
        """
        Closes this plugin wrapper.
        """

    @staticmethod
    def process(data):
        """
        This will process the data and return it.
        :param data: input data
        :type data: bytes
        :return: processed data
        :rtype: bytes
        """

GzipPlugin example:

@implementer(PostProcessorPlugin)
class GzipPlugin:
    COMPRESS_LEVEL = 9

    def __init__(self. file):
        # initialise various parameters for gzipping
        self.file = gzip.GzipFile(fileobj=file, mode=file.mode,
                                  compresslevel=self.COMPRESS_LEVEL)

    def write(self, data):
        return self.file.write(data)

    def close(self):
        self.file.close()

    @staticmethod
    def process(data):
        return gzip.compress(data, compresslevel=self.COMPRESS_LEVEL)

settings.py example:

from myproject.pluginfile import MyPlugin

FEEDS = {
    'item1.json' : {
        'format': 'json',
        'post-processing': ['gzip'],
    },
    'item2.xml' : {
        'post-processing': [MyPlugin,'xz'],    # order is important
    },
}


POST_PROC_PLUGINS_BASE = {
    'gzip': 'scrapy.utils.postprocessors.GzipPlugin',
    'xz' : 'scrapy.utils.postprocessors.LZMAPlugin',
    'bz2': 'scrapy.utils.postprocessors.Bz2Plugin',
}

Describe alternatives you’ve considered

This feature idea is actually an expansion on compression support(see #2174). Item Pipelines can be used as well for compression. But implementing this feature instead can give more options to user for post-processing while making it easier for the user to activate those post-processing components for specific feeds.

Additional context

This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
Gallaeciocommented, Jun 24, 2021

While namespacing plugin settings into a dictionary sounds reasonable, in either of the approaches you suggest, I think the simplest approach, and the one most in line with existing code, is to let plugin-specific options be defined directly among feed_options, with the namespace built into each setting name. For example:

{
    'items.json': {
        'format': 'json',
        'postprocessing': ["scrapy.extensions.feedexport.GZipProcessor"],
        'gzip_compression_level': 1,
    },
}

This goes in line with the approach of Scrapy settings, where namespaces are built into the setting names. It also makes it easy for one setting to be interpreted by more than one plugin, which is not something I would generally recommend but I believe use-cases for that could exists.

1reaction
Gallaeciocommented, May 31, 2021

This comment is exclusively about the plugin interface, which I think is the most important one (the manager will be an internal component, but users will write their own plugins).

I don’t think the processing method should be static. Plugins should be Scrapy components (i.e. it should be possible to add a from_crawler method to them to initialize them with access to settings), and hence they could read custom settings that condition their behavior. I think we need a regular method that can access object variables.

And I think we should make the plugin interface similar to, for example, zipfile.ZipFile:

with Plugin(output_file_object) as plugin:
    plugin.write(data)

So that plugin chaining could look like this:

with Plugin2(output_file_object) as plugin2:
    with Plugin1(plugin2) as plugin1:
        plugin1.write(data)

Of course, with an arbitrary number of plugins, the manager code would have you actually call the open and close methods of those plugins, instead of using with, but I hope this example helps visualize the main idea: plugins should get write calls, and in those calls they should be able to write into the next plugin (or the final file, in the case of the last plugin), but not required to (e.g. some plugins may store the input data internally, and write it all at once into the next plugin when the input data stops [close method gets called]).

Read more comments on GitHub >

github_iconTop Results From Across the Web

GSoC 2021: Feeds enhancements · Issue #4963 - GitHub
This is a single issue to discuss feeds enhancements as a project for GSoC 2021. My idea here is to make a project...
Read more >
Postprocessing - an overview | ScienceDirect Topics
Postprocessing is commonly used following video decoding to reduce the visual impact of coding artifacts and to enhance the overall quality of reconstructed ......
Read more >
Getting increased rapid feed rate changes after ...
Issue: Getting increased rapid feed rate changes after postprocessing toolpath in PowerMill. View below shows PowerMill Vortex toolpath feed ...
Read more >
Advanced Digital Post-Processing Techniques Enhance ...
This characterization routine feeds the ADCs' measured transfer functions directly into the AFB coefficient calculation process. Once the ADCs have been ...
Read more >
Post Processing Strategies for the Enhancement of ... - NCBI
In order to ensure homogenous heat treatment, both upper and lower surfaces were treated by feeding the membrane twice in the press. Inter...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found