Batch deliveries for long running crawlers
See original GitHub issueSummary
Add a new setting FEED_STORAGE_BATCH
that will deliver a file whenever item_scraped_count
reaches a multiple of that number.
Motivation
For long running jobs (say we are consuming inputs from a working queue) we may want partial results instead of waiting for a long batch to finish.
Describe alternatives you’ve considered
Of course we can stop and restart a spider every now and then. However, a simpler approach is to have it running as long as required, but delivering partial results.
Issue Analytics
- State:
- Created 4 years ago
- Comments:17 (13 by maintainers)
Top Results From Across the Web
batch-get-crawlers — AWS CLI 1.27.28 Command Reference
Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all...
Read more >Monitor your crawler's progress with JEF Monitor – Norconex Inc
Keeping track of individual crawler execution can be challenging. How many are currently running? For how long? Any of them failed?
Read more >AWS Batch & Amazon EC2 Spot Instances - YouTube
Complex analytics, such as log scanning or simulations, typically performed as batch jobs, can be completed cost-effectively with Amazon EC2 ...
Read more >Setup Kinesis Firehose, S3 and Athena - AWS Workshop Studio
Before you run your first query, you need to set up a query result location in ... It can capture, transform, and deliver...
Read more >Demystifying the ways of creating partitions in Glue Catalog ...
When you have a lot of data in S3, running the crawlers too frequently ... -creation-of-Athena-partitions-for-Firehose-delivery-streams.html ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@dipiana I don’t know exactly, but when I finish I will inform you from this thread.
@dipiana it is not merged yet. But before merging, we always update the project documentation with the proper details. Maybe @BroodingKangaroo can include a simple tutorial once it is finished