Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incremental PartitionedDataset saves

See original GitHub issue

Description

PartitionDatasets require returning a full dictionary of (partition name, data) pairs, which then get saved all at once after node execution. This is frustrating when you have partitions that are large, or if you have a long-running tasks that fails.

Context

I am creating a deep ensemble by running inference using many models. If I get a runtime error, I lose all the cached inference results from the already-run models. This happened to me 2 days into an inference job, because my cluster’s ssh connection timed out.
I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum allowable ensemble size. So, I am forced to run this pipeline on a memory-optimized EC2 instance when it could otherwise run on my laptop.

Possible Implementation

Allow nodes writing to a PartitionedDataset to yield results one at a time, e.g.

def partition_dataset_writer() -> Dict[str, pd.DataFrame]:
    for _ in range(10):
       part = {"part_name": pd.DataFrame(...)}
       yield part

Issue Analytics

State:
Created 3 years ago
Reactions:10
Comments:9 (5 by maintainers)

Top GitHub Comments

2reactions

elephantumcommented, Dec 23, 2020

Same here, this design limitation is very frustrating.

Imaging preprocessing 10K images.

1reaction

t00rgorecommented, Dec 22, 2020

I am facing a similar need, would be great to have an incremental save option.

Top Results From Across the Web

Incremental PartitionedDataset saves · Issue #499 · kedro-org ...

I am doing an ablation study for this ensemble. The number of partitions in one of my PartitionedDataset increase exponentially with the maximum ......

kedro.io.IncrementalDataSet — Kedro 0.18.4 documentation

IncrementalDataSet inherits from PartitionedDataSet , which loads and saves partitioned file-like data using the underlying dataset definition.

Why and How to Use IncrementalDataSet - YouTube

Why and How to Use IncrementalDataSet - Writing Data Pipelines with Kedro 8. Watch later. Share. Copy link. Info. Shopping. Tap to unmute....

Incremental Versioned Datasets in Kedro - Waylon Walker

Once we have the nodes and catalog setup, we can run the pipeline a few times to get some versioned data. Each time...

How do I build a large incremental output dataset from an ...

I have an 80TB date-partitioned dataset in Palantir Foundry, which ingests 300-450GB of data in an incremental Append transaction every 3 ...