question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch deliveries for long running crawlers

See original GitHub issue

Summary

Add a new setting FEED_STORAGE_BATCH that will deliver a file whenever item_scraped_count reaches a multiple of that number.

Motivation

For long running jobs (say we are consuming inputs from a working queue) we may want partial results instead of waiting for a long batch to finish.

Describe alternatives you’ve considered

Of course we can stop and restart a spider every now and then. However, a simpler approach is to have it running as long as required, but delivering partial results.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:17 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
BroodingKangaroocommented, Mar 19, 2020

@dipiana I don’t know exactly, but when I finish I will inform you from this thread.

2reactions
ejuliocommented, Mar 18, 2020

@dipiana it is not merged yet. But before merging, we always update the project documentation with the proper details. Maybe @BroodingKangaroo can include a simple tutorial once it is finished

Read more comments on GitHub >

github_iconTop Results From Across the Web

batch-get-crawlers — AWS CLI 1.27.28 Command Reference
Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all...
Read more >
Monitor your crawler's progress with JEF Monitor – Norconex Inc
Keeping track of individual crawler execution can be challenging. How many are currently running? For how long? Any of them failed?
Read more >
AWS Batch & Amazon EC2 Spot Instances - YouTube
Complex analytics, such as log scanning or simulations, typically performed as batch jobs, can be completed cost-effectively with Amazon EC2 ...
Read more >
Setup Kinesis Firehose, S3 and Athena - AWS Workshop Studio
Before you run your first query, you need to set up a query result location in ... It can capture, transform, and deliver...
Read more >
Demystifying the ways of creating partitions in Glue Catalog ...
When you have a lot of data in S3, running the crawlers too frequently ... -creation-of-Athena-partitions-for-Firehose-delivery-streams.html ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found