Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Should we change the output of `session.run`?

See original GitHub issue

Background

What’s the output of session.run()? Currently, this is not clear as you think and it isn’t documented anywhere. The logic is defined in runner.py, this can be counter-intuitive in some cases, is there a good reason why we want to do this?

https://github.com/kedro-org/kedro/blob/f4914201d7f6f38318c2c1f074fcdf802b3e1e0d/kedro/runner/runner.py#L78-L91

kedro has improved a lot in terms of how to run the pipeline with packaging & KedroSession as a standalone application, #1423 documents different ways to do it. Personally, I think it is still not easy enough to integrate with kedro for someone who is inexperienced with kedro. In #1423, It mentioned how a pipeline can be called programmatically. Even though the pipeline itself is a function call, it doesn’t behave like a function, i.e. you can’t really define an input as an argument easily (it has to be a Catalog entry), the output of the pipeline is also very restricted.

Motivation

Kedro works really well within the kedro world, but it also mean that kedro works very differently from the rest of the Python world.

This issue mainly focuses on the output side, this will improve the experience to integrate the kedro pipeline as an upstream. In a over-simplified world, this should be straight forward to do. Currently I think we a strong assumption that people work with “Kedro Project”, but if we are moving towards a kedro package, i.e. using from kedro_package import main, it should behave just like a Python function, I think this is a reasonable expectation.

1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionWebService(model)

Questions

What should be return with session.run?

Things to consider

How can any Python developer integrate with the kedro pipeline easily? Can it behave just like a function?
In an interactive workflow, it may make sense to keep all intermediate output in the resulting dict
Is there a known reason why the output is defined as it is?

Related Issue:

It would enable a better interactive workflow #1721
#1423 Is trying to improve/simplifies how we run kedro pipeline as a standalone package
#795 Discussion of how to use kedro as upstream/downstream

#1832

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

noklamcommented, Oct 6, 2022

I just give it a go to see what would it takes to make the initial idea works, partly because I want to test how the nbdev system works. See DebugRunner

https://noklam.github.io/kedro-debug-runner/core.html

1reaction

noklamcommented, Oct 5, 2022

Supplement on the above comments to address @AntonyMilneQB question:

i.e. we could have free_output = pipeline.outputs()

The answer to that is there is a catalog.load call at the end, it’s an expensive call and potentially memory hungry. So persisted datasets are deleted from memory as long as they are not needed. For MemoryDataSet, it’s loaded in memory already, so there is no harm to return it.