question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consistent nodes execution order with `SequentialRunner`

See original GitHub issue

Description

I’m always frustrated when I execute an identical pipeline it can yield different results. The root cause of this issue is due to the fact that multiple combinations of nodes exists. Kedro only try to solve the DAGs by finding 1 possible solution, but it is not guaranteed to be the same.

One workaround is to specify input/output of nodes to make sure there is only 1 possible solution, but this is not ideal as users has to maintain arbitrary dummy variables.

What is missing in the DAGs?

Seed of random number generator. Consider a simple pipeline with 3 nodes:

A-- \    
      \   
       C   
      /   
    /   
B--

In this pipeline, there are 2 possible execution order with SequentialRunner, 1. A->B->C, 2. B->A->C. Although there are no strong preference whether 1/2 is better, it is better to stick with one of them, as the output can be changed.

Context

In data science/machine learning pipeline, setting a seed to ensure reproducible result are very common, and currently there are no easy way to achieve this.

Possible Implementation

Ensure the resolved nodes are sorted so it always run in the same order with SequentialRunner.

Possible Alternatives

(Optional) Describe any alternative solutions or features you’ve considered.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
datajoelycommented, Mar 16, 2022

I’ll just add users do ask for this maybe once every two months, I’ve even seen people introduce fake nodes to force ordering.

2reactions
datajoelycommented, Mar 15, 2022

So on this - we use the toposort~=1.5 external library, since Kedro was released there is now a stdlib https://docs.python.org/3/library/graphlib.html module that does the same thing. If we do look into making this deterministic, it might be a good opportunity to adopt this too.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Run a pipeline — Kedro 0.17.4 documentation - Read the Docs
Use SequentialRunner to execute pipeline nodes one-by-one based on their dependencies. We recommend using SequentialRunner in cases where: the pipeline has ...
Read more >
Consistent arguments between `kedro run` CLI and the `--config ...
Using the CLI argument kedro run --from-nodes=some_node; Using the --config ... Consistent node execution order by sorting node with Sequentialrunner #1604.
Read more >
How to run the nodes in sequence as declared in kedro ...
The answer that I recieved from Kedro github: Pipeline determines the node execution order exclusively based on dataset dependencies (node ...
Read more >
Kedro: The Best Python Framework for Data Science!!!
The pipeline determines the order of execution of the node by resolving ... import node, Pipeline from kedro.runner import SequentialRunner ...
Read more >
Timeline of a workflow execution - Flyte
Transition latency refers to the time between successive node executions, that is, ... (as this process is eventually consistent using informer caches).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found