GSoC 2021: Feeds enhancements
See original GitHub issueThis is a single issue to discuss feeds enhancements as a project for GSoC 2021.
My idea here is to make a project to work on 3 (or more) improvements detailed below.
1 - filter items on export based on custom rules.
Relevant issues:
- https://github.com/scrapy/scrapy/issues/4607
- https://github.com/scrapy/scrapy/issues/4575
- https://github.com/scrapy/scrapy/issues/4786
- https://github.com/scrapy/scrapy/issues/3193
There is already a PR for this one (note my last comment there) https://github.com/scrapy/scrapy/pull/4576
However, if the author doesn’t reply on time, we can continue the work from the branch and only finish the feature.
2 - Compress feeds
This is an old feature request and there’s an issue for it here https://github.com/scrapy/scrapy/issues/2174 The API changed a bit since then, but I think it’d be something like
FEEDS = {
"myfile.jl": {
"compression": "gzip"
}
}
I think gzip is a good starting point, but we should put some effort to design an API that will be extensible and allow different formats.
3 - Spider open/close a batch
Recently we added support for batch delivery in scapy. Say, every X items, we deliver a file and open a new file. Sometimes, we don’t know the threshold upfront or it can be based on an external signal. In this case, we should be able to trigger a batch delivery from the spider. I have two possible ideas for it:
- Create a new signal:
scrapy.signals.close_batch
- Add a method that can be called from the spider
spider.feed_exporter.close_batch()
Note that, this can be tricky as we allow multiple feeds (so it may require an argument specifying which feed batch to close).
Issue Analytics
- State:
- Created 3 years ago
- Comments:35 (35 by maintainers)
Top GitHub Comments
I would summarize it as rewriting
S3FeedStorage
using boto3’supload_fileobj
method, which automatically uses multi-part support for big files.Let’s see if I can address your questions:
First off, the 3 ideas here are suggestions. Just as you can add more improvements to your proposal, you can exclude some of these improvements from your proposal.
About flexible batching, it’s not clear to me either what would be the best way to implement this. If we could come up with a way to allow for maximum flexibility here (e.g. imagine you need to call an HTTP API to determine if you should start a new batch), that would be awesome. However, supporting splitting batches by bytes or at specific time intervals should be more straightforward to implement, and those are I think the most common scenarios, so I think you could go for that instead in your proposal.
As for compression and batch delivery, I would go for compressing each file separately. While compressing the whole output may be a valid use case, I believe the main point of batch delivery is to get access early to part of the output of a spider, and compressing all files into a single compressed archive would go against that.
Sorry for the delay on the feedback. From now until the proposal deadline I hope to be available on a daily basis (Mon-Fri).