question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GSoC 2021: Feeds enhancements

See original GitHub issue

This is a single issue to discuss feeds enhancements as a project for GSoC 2021.

My idea here is to make a project to work on 3 (or more) improvements detailed below.

1 - filter items on export based on custom rules.

Relevant issues:

There is already a PR for this one (note my last comment there) https://github.com/scrapy/scrapy/pull/4576

However, if the author doesn’t reply on time, we can continue the work from the branch and only finish the feature.

2 - Compress feeds

This is an old feature request and there’s an issue for it here https://github.com/scrapy/scrapy/issues/2174 The API changed a bit since then, but I think it’d be something like

FEEDS = {
    "myfile.jl": {
        "compression": "gzip"
    }
}

I think gzip is a good starting point, but we should put some effort to design an API that will be extensible and allow different formats.

3 - Spider open/close a batch

Recently we added support for batch delivery in scapy. Say, every X items, we deliver a file and open a new file. Sometimes, we don’t know the threshold upfront or it can be based on an external signal. In this case, we should be able to trigger a batch delivery from the spider. I have two possible ideas for it:

  • Create a new signal: scrapy.signals.close_batch
  • Add a method that can be called from the spider spider.feed_exporter.close_batch()

Note that, this can be tricky as we allow multiple feeds (so it may require an argument specifying which feed batch to close).

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:35 (35 by maintainers)

github_iconTop GitHub Comments

2reactions
Gallaeciocommented, Apr 13, 2021

Can you please provide a little more input on what you are expecting from this patch?

I would summarize it as rewriting S3FeedStorage using boto3’s upload_fileobj method, which automatically uses multi-part support for big files.

2reactions
Gallaeciocommented, Apr 5, 2021

Let’s see if I can address your questions:

  • First off, the 3 ideas here are suggestions. Just as you can add more improvements to your proposal, you can exclude some of these improvements from your proposal.

  • About flexible batching, it’s not clear to me either what would be the best way to implement this. If we could come up with a way to allow for maximum flexibility here (e.g. imagine you need to call an HTTP API to determine if you should start a new batch), that would be awesome. However, supporting splitting batches by bytes or at specific time intervals should be more straightforward to implement, and those are I think the most common scenarios, so I think you could go for that instead in your proposal.

  • As for compression and batch delivery, I would go for compressing each file separately. While compressing the whole output may be a valid use case, I believe the main point of batch delivery is to get access early to part of the output of a spider, and compressing all files into a single compressed archive would go against that.

Sorry for the delay on the feedback. From now until the proposal deadline I hope to be available on a daily basis (Mon-Fri).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy: Feed Enhancements
This project aims to add enhancements to Scrapy's Feed Exporter components. These enhancements consists of item filters, feed post-processing and batch ...
Read more >
Proposal for GSoC 2021 - Scrapy's Feed Enhancements
This project aims to add enhancements to Scrapy's Feed Exporter components. These enhancements consists of item lters, feed post-.
Read more >
GSoC 2021 Improvements to Haiku-format Final update
GSoC 2021 Improvements to Haiku-format Final update. It has rightly been said - “All good things come to an end”. Google Summer of...
Read more >
GSoC/2021/Ideas
Contents · 1 Project: Reference Images Improvements · 2 Project: Modern/high quality scaling algorithm · 3 Project: Export a document to SVG ·...
Read more >
GSoC 2021 - Progress report #1
Improvements to Godot's soft-body dynamics by Jeff Cochran (jeffrey-cochran); Automated graph layout in VisualScript & VisualShader editors by ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found