Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workflow of debugging Kedro pipeline in notebook

See original GitHub issue

Background

Kedro’s philosophy is pretty much you should use a notebook wisely and keep your code as a Python module. But there are situation that you have to debug with a notebook because data infrastructure is tied to the platform.

What are the pain points with debugging Kedro Pipeline?

The situation that you have to use a notebook - use cases may be notebook base platforms like Databricks, or debugger just slow down massively when you loaded large datasets in memory and have to use a notebook as debug session instead. In this case, debugging a python module is annoying as the source code doesn’t live with it, some copy & paste or monkey-patching seems to be unavoidable.
Schedule/Distribute cluster job - use cases may be Deployment, or trying to leverage cluster computing for large-scale ML experiments. In this case, you can’t attach a debugger and I don’t know any workaround for this. If it’s just one remote server then you can attach a remote debugger that VS Code & PyCharm would support. suggested
Kedro-specific API isn’t friendly enough - MemoryDataSet / CacheDataSet and KedroSession do not have the most user-friendly interface for an interactive environment like Notebook. I think 3. is something Kedro should solve and I would love more feedback about this. 1 is not a kedro-specific problem, but it’s more common to kedro users due to data science/ML workflow and we may try to make it easier. I don’t have any workaround for 2.

My opinion is:

Not a Kedro-specific problem, but it’s a more common problem to Kedro users because of the nature of ML/data science pipeline, and we may figure out what may be a smoother workflow.
Not a Kedro problem, this is true for any Python program, and I don’t see anything Kedro could do about this (yet)
This is a Kedro problem that we should improve.

I talked to Tom earlier and try to understand the debugging process that he had.

Steps to debug Kedro pipeline in a notebook

Read from stack trace - find out the line of code that produce the error
Find which node this function belongs to
Trying to rerun the pipeline just before this node
If it’s not a persisted dataset, you need to change it in catalog.yml, and re-run the pipeline, error is thrown again
session has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreate session and do something similar to session.run
Create a new session or %reload_kedro?
Now catalog.load that persisted dataset, i.e. func(catalog.load("some_data"))
Copy the source code of func to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that’s a lot more copy-pasting and change of import maybe.
Change the source code and make it work in the notebook
Rerun the pipeline to ensure everything works

Note that if this is a local development environment, all you would do is set a breakpoint. But you will have to touch on a few files with a notebook i.e.

Notebook cell
Source code of the function that cause the problem
catalog.yml

Problems

KedroSession cannot be re-run, user will called session.run multiple time for debugging purpose.
Sometimes session.run doesn’t give the correct output and this issue try to address this problem #1802
Errors happened in the 50th node in a 100 nodes pipeline - how can we remove some steps so less copy & paste is needed?
Not all nodes write data to disk - this means they can’t be recovered easily. It makes sense to have most things in memory, but can we make it easier for debug sessions, where users can change this behavior instead of going to change every entry in a catalog?

Why it would be less of a problem with stuff like Airflow?

All datasets are persisted - so each node is a self-contained node, you only need to rerun the interested node
UI will show you clearly which node fails

Proposal

We are definitely not trying to re-create the debugger experience. Ideally, it would be great if Kedro can just pop out the correct context at the exact line of code (similar to putting a breakpoint right before an error happen).

Easier way to re-use session? or %reload_kedro enough? If we want to keep things in memory then reload_kedro does not fit well.
%load_node proposal mentioned in #1721 - which should address Step1-Step7
Some debug mode with session which you can do session.run(dataset=["a","b","c"]) and keep specific dataset you are interested, or even

Some of this can reuse the backtracking logic we have in #1795 so we don’t have to rerun the entire pipeline.

Issue Analytics

State:
Created a year ago
Reactions:3
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

merelchtcommented, Nov 9, 2022

Discussed in Technical Design

The general agreement is that we need to improve the debugging workflow of Kedro in notebooks. Concrete actions to achieve this:

https://github.com/kedro-org/kedro/issues/2009 Implement an ipython line magic that can load a node, the necessary datasets and copy the node code into a notebook cell. This is the %load_node idea described above. We might want to call this magic %debug_node instead to make it clear it does more than just loading the node and is meant for debugging.
As a follow up on the above, we need to document the debugging workflow for notebooks. Currently we only talk about debugging in through IDEs in our docs. https://github.com/kedro-org/kedro/issues/2011
https://github.com/kedro-org/kedro/issues/2010 If so, document this for our users.

0reactions

noklamcommented, Nov 9, 2022

Potentially useful IPython magic

get_ipython().set_next_input(s)
%debug
%load
from inspect import getsource

Top Results From Across the Web

Debugging — Kedro 0.18.4 documentation

Debugging nodes outside the run session isn't very helpful because getting access to the local scope within the node can be hard, especially...

Debugging — Kedro 0.17.5 documentation - Read the Docs

Debugging nodes outside the run session isn't very helpful because getting access to the local scope within the node can be hard, especially...

Debugging — Kedro 0.18.1 documentation

If you have long running nodes or pipelines, inserting print statements and running them multiple times quickly becomes a time-consuming procedure. Debugging ......

Debugging — Kedro 0.17.2 documentation - Read the Docs

Debugging nodes outside the run session isn't very helpful because getting access to the local scope within the node can be hard, especially...

Use Kedro with IPython and Jupyter Notebooks/Lab

You may want to use a Python kernel inside a Jupyter notebook (formerly known ... Built into the Kedro Jupyter workflow is the...