question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Versioned datasets don't work with Prefect

See original GitHub issue

Description

Versioned datasets used as outputs causes node to fail when scheduling runs with Prefect.

Context

I’m trying to run a Kedro pipeline in Prefect. Because some of the output datasets used by nodes in the pipeline are versioned, running the node fails.

Steps to Reproduce

  1. Follow the Prefect deployment guide to set up Prefect and register the pipeline with Prefect. As a side note, the script on the current page doesn’t work for me and I had to update one of the imports.
  2. Running the pipeline in Prefect works once. Running it a second time or scheduling runs fails with a DataSetError.

Expected Result

Pipeline should be able to run when triggered after the first time.

Actual Result

Execution fails with the error.

kedro.io.core.DataSetError: Save path `/a/b/c/2022-03-29T14.20.46.529Z/xyz` 
for ParquetDataSet(filepath=/a/b/c/xyz, load_args={}, protocol=file, save_args={}, 
version=Version(load=None, save='2022-03-29T14.20.46.529Z')) 
must not exist if versioning is enabled. 

It seems the Dataset is reusing the timestamp from when the register_prefect_flow.py is executed instead of the actual run time of the pipeline. In this case I ran the registration script at 14:20 and triggered the pipeline to be run at 14:23 but the timestamp in the error message above corresponds to the script run time and not trigger time.

I looked around in Kedro code a bit and it seems the function generating the timestamp is cached, but not sure if that’s all there is to it.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.17.7
  • Python version used (python -V): 3.8.12
  • Operating system and version: MacOS 11.6.4
  • Prefect version: 1.1.0

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
avan-shcommented, Apr 19, 2022

This might still be relevant. I think the pr only updates the documentation to latest prefect version and doesn’t fix the issue on versioned datasets.

0reactions
merelchtcommented, Aug 11, 2022

@ardoi and @avan-sh there’s now a PR open that addresses this issue: https://github.com/kedro-org/kedro/pull/1775 would be awesome to get your review on that!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data lineage (simple) · Discussion #4935 · PrefectHQ/prefect
I'd like to have a discussion (brain dump) about data lineage in prefect; ... and the final output from the flows is either...
Read more >
Changelog - Prefect Docs
This hotfix release fixes an issue where the kubernetes agent would attempt to load a secret value and fail if it was not...
Read more >
Deployment with Prefect — Kedro 0.18.0 documentation
This page explains how to run your Kedro pipeline using Prefect Core, an open-source workflow management system. In scope of this deployment, ...
Read more >
Orchestrate a Data Science Project in Python With Prefect
To execute the workflow with the default parameters, click Quick Run in the top right corner. Click the run that is created.
Read more >
Exploring Lineage History via the Marquez API - OpenLineage
Dataset versions work differently from job versions. They don't only change when the structure changes. Every time a job run modifies or writes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found