question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incremental PartitionedDataset saves

See original GitHub issue

Description

PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.

Context

  1. I am creating a deep ensemble by running inference using many models. If I get a runtime error, I lose all the cached inference results from the already-run models. This happened to me 2 days into an inference job, because my cluster’s ssh connection timed out.
  2. I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum allowable ensemble size. So, I am forced to run this pipeline on a memory-optimized EC2 instance when it could otherwise run on my laptop.

Possible Implementation

Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.

def partition_dataset_writer() -> Dict[str, pd.DataFrame]:
    for _ in range(10):
       part = {"part_name": pd.DataFrame(...)}
       yield part

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:10
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
elephantumcommented, Dec 23, 2020

Same here, this design limitation is very frustrating.

Imaging preprocessing 10K images.

1reaction
t00rgorecommented, Dec 22, 2020

I am facing a similar need, would be great to have an incremental save option.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Incremental PartitionedDataset saves · Issue #499 · kedro-org ...
I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum ......
Read more >
kedro.io.IncrementalDataSet — Kedro 0.18.4 documentation
IncrementalDataSet inherits from PartitionedDataSet , which loads and saves partitioned file-like data using the underlying dataset definition.
Read more >
Why and How to Use IncrementalDataSet - YouTube
Why and How to Use IncrementalDataSet - Writing Data Pipelines with Kedro 8. Watch later. Share. Copy link. Info. Shopping. Tap to unmute....
Read more >
Incremental Versioned Datasets in Kedro - Waylon Walker
Once we have the nodes and catalog setup, we can run the pipeline a few times to get some versioned data. Each time...
Read more >
How do I build a large incremental output dataset from an ...
I have an 80TB date-partitioned dataset in Palantir Foundry, which ingests 300-450GB of data in an incremental Append transaction every 3 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found