Incremental PartitionedDataset saves
See original GitHub issueDescription
PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.
Context
- I am creating a deep ensemble by running inference using many models. If I get a runtime error, I lose all the cached inference results from the already-run models. This happened to me 2 days into an inference job, because my cluster’s ssh connection timed out.
- I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum allowable ensemble size. So, I am forced to run this pipeline on a memory-optimized EC2 instance when it could otherwise run on my laptop.
Possible Implementation
Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.
def partition_dataset_writer() -> Dict[str, pd.DataFrame]:
for _ in range(10):
part = {"part_name": pd.DataFrame(...)}
yield part
Issue Analytics
- State:
- Created 3 years ago
- Reactions:10
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Incremental PartitionedDataset saves · Issue #499 · kedro-org ...
I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum ......
Read more >kedro.io.IncrementalDataSet — Kedro 0.18.4 documentation
IncrementalDataSet inherits from PartitionedDataSet , which loads and saves partitioned file-like data using the underlying dataset definition.
Read more >Why and How to Use IncrementalDataSet - YouTube
Why and How to Use IncrementalDataSet - Writing Data Pipelines with Kedro 8. Watch later. Share. Copy link. Info. Shopping. Tap to unmute....
Read more >Incremental Versioned Datasets in Kedro - Waylon Walker
Once we have the nodes and catalog setup, we can run the pipeline a few times to get some versioned data. Each time...
Read more >How do I build a large incremental output dataset from an ...
I have an 80TB date-partitioned dataset in Palantir Foundry, which ingests 300-450GB of data in an incremental Append transaction every 3 ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Same here, this design limitation is very frustrating.
Imaging preprocessing 10K images.
I am facing a similar need, would be great to have an incremental save option.