Workflow of debugging Kedro pipeline in notebook
See original GitHub issueBackground
Kedro’s philosophy is pretty much you should use a notebook wisely and keep your code as a Python module. But there are situation that you have to debug with a notebook because data infrastructure is tied to the platform.
What are the pain points with debugging Kedro Pipeline?
- The situation that you have to use a notebook - use cases may be notebook base platforms like Databricks, or debugger just slow down massively when you loaded large datasets in memory and have to use a notebook as debug session instead. In this case, debugging a python module is annoying as the source code doesn’t live with it, some copy & paste or monkey-patching seems to be unavoidable.
- Schedule/Distribute cluster job - use cases may be Deployment, or trying to leverage cluster computing for large-scale ML experiments. In this case, you can’t attach a debugger and I don’t know any workaround for this. If it’s just one remote server then you can attach a remote debugger that VS Code & PyCharm would support. suggested
- Kedro-specific API isn’t friendly enough - MemoryDataSet / CacheDataSet and KedroSession do not have the most user-friendly interface for an interactive environment like Notebook. I think 3. is something Kedro should solve and I would love more feedback about this. 1 is not a kedro-specific problem, but it’s more common to kedro users due to data science/ML workflow and we may try to make it easier. I don’t have any workaround for 2.
My opinion is:
- Not a Kedro-specific problem, but it’s a more common problem to Kedro users because of the nature of ML/data science pipeline, and we may figure out what may be a smoother workflow.
- Not a Kedro problem, this is true for any Python program, and I don’t see anything Kedro could do about this (yet)
- This is a Kedro problem that we should improve.
I talked to Tom earlier and try to understand the debugging process that he had.
Steps to debug Kedro pipeline in a notebook
- Read from stack trace - find out the line of code that produce the error
- Find which node this function belongs to
- Trying to rerun the pipeline just before this node
- If it’s not a persisted dataset, you need to change it in
catalog.yml
, and re-run the pipeline, error is thrown again session
has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreatesession
and do something similar tosession.run
- Create a new session or
%reload_kedro
? - Now
catalog.load
that persisted dataset, i.e.func(catalog.load("some_data"))
- Copy the source code of
func
to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that’s a lot more copy-pasting and change of import maybe. - Change the source code and make it work in the notebook
- Rerun the pipeline to ensure everything works
Note that if this is a local development environment, all you would do is set a breakpoint. But you will have to touch on a few files with a notebook i.e.
- Notebook cell
- Source code of the function that cause the problem
catalog.yml
Problems
KedroSession
cannot be re-run, user will calledsession.run
multiple time for debugging purpose.- Sometimes
session.run
doesn’t give the correct output and this issue try to address this problem #1802 - Errors happened in the 50th node in a 100 nodes pipeline - how can we remove some steps so less copy & paste is needed?
- Not all nodes write data to disk - this means they can’t be recovered easily. It makes sense to have most things in memory, but can we make it easier for debug sessions, where users can change this behavior instead of going to change every entry in a catalog?
Why it would be less of a problem with stuff like Airflow?
- All datasets are persisted - so each node is a self-contained node, you only need to rerun the interested node
- UI will show you clearly which node fails
Proposal
We are definitely not trying to re-create the debugger experience. Ideally, it would be great if Kedro can just pop out the correct context at the exact line of code (similar to putting a breakpoint right before an error happen).
- Easier way to re-use session? or
%reload_kedro
enough? If we want to keep things in memory thenreload_kedro
does not fit well. -
%load_node
proposal mentioned in #1721 - which should address Step1-Step7 - Some debug mode with
session
which you can dosession.run(dataset=["a","b","c"])
and keep specific dataset you are interested, or even
Some of this can reuse the backtracking logic we have in #1795 so we don’t have to rerun the entire pipeline.
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:5 (5 by maintainers)
Top GitHub Comments
Discussed in Technical Design
The general agreement is that we need to improve the debugging workflow of Kedro in notebooks. Concrete actions to achieve this:
%load_node
idea described above. We might want to call this magic%debug_node
instead to make it clear it does more than just loading the node and is meant for debugging.Potentially useful IPython magic
get_ipython().set_next_input(s)
%debug
%load
from inspect import getsource