Should we change the output of `session.run`?
See original GitHub issueBackground
What’s the output of session.run()
? Currently, this is not clear as you think and it isn’t documented anywhere. The logic is defined in runner.py
, this can be counter-intuitive in some cases, is there a good reason why we want to do this?
kedro
has improved a lot in terms of how to run the pipeline with packaging & KedroSession
as a standalone application, #1423 documents different ways to do it. Personally, I think it is still not easy enough to integrate with kedro
for someone who is inexperienced with kedro. In #1423, It mentioned how a pipeline can be called programmatically. Even though the pipeline itself is a function call, it doesn’t behave like a function, i.e. you can’t really define an input as an argument easily (it has to be a Catalog entry), the output
of the pipeline is also very restricted.
Motivation
Kedro works really well within the kedro world, but it also mean that kedro works very differently from the rest of the Python world.
This issue mainly focuses on the output
side, this will improve the experience to integrate the kedro
pipeline as an upstream. In a over-simplified world, this should be straight forward to do. Currently I think we a strong assumption that people work with “Kedro Project”, but if we are moving towards a kedro package, i.e. using from kedro_package import main
, it should behave just like a Python function, I think this is a reasonable expectation.
1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionWebService(model)
Questions
- What should be return with
session.run
?
Things to consider
- How can any Python developer integrate with the kedro pipeline easily? Can it behave just like a function?
- In an interactive workflow, it may make sense to keep all intermediate output in the resulting dict
- Is there a known reason why the output is defined as it is?
Related Issue:
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
I just give it a go to see what would it takes to make the initial idea works, partly because I want to test how the
nbdev
system works. SeeDebugRunner
https://noklam.github.io/kedro-debug-runner/core.html
Supplement on the above comments to address @AntonyMilneQB question:
The answer to that is there is a
catalog.load
call at the end, it’s an expensive call and potentially memory hungry. So persisted datasets are deleted from memory as long as they are not needed. ForMemoryDataSet
, it’s loaded in memory already, so there is no harm to return it.