question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workflow of debugging Kedro pipeline in notebook

See original GitHub issue

Background

Kedro’s philosophy is pretty much you should use a notebook wisely and keep your code as a Python module. But there are situation that you have to debug with a notebook because data infrastructure is tied to the platform.

What are the pain points with debugging Kedro Pipeline?

  1. The situation that you have to use a notebook - use cases may be notebook base platforms like Databricks, or debugger just slow down massively when you loaded large datasets in memory and have to use a notebook as debug session instead. In this case, debugging a python module is annoying as the source code doesn’t live with it, some copy & paste or monkey-patching seems to be unavoidable.
  2. Schedule/Distribute cluster job - use cases may be Deployment, or trying to leverage cluster computing for large-scale ML experiments. In this case, you can’t attach a debugger and I don’t know any workaround for this. If it’s just one remote server then you can attach a remote debugger that VS Code & PyCharm would support. suggested
  3. Kedro-specific API isn’t friendly enough - MemoryDataSet / CacheDataSet and KedroSession do not have the most user-friendly interface for an interactive environment like Notebook. I think 3. is something Kedro should solve and I would love more feedback about this. 1 is not a kedro-specific problem, but it’s more common to kedro users due to data science/ML workflow and we may try to make it easier. I don’t have any workaround for 2.

My opinion is:

  1. Not a Kedro-specific problem, but it’s a more common problem to Kedro users because of the nature of ML/data science pipeline, and we may figure out what may be a smoother workflow.
  2. Not a Kedro problem, this is true for any Python program, and I don’t see anything Kedro could do about this (yet)
  3. This is a Kedro problem that we should improve.

I talked to Tom earlier and try to understand the debugging process that he had.

Steps to debug Kedro pipeline in a notebook

  1. Read from stack trace - find out the line of code that produce the error
  2. Find which node this function belongs to
  3. Trying to rerun the pipeline just before this node
  4. If it’s not a persisted dataset, you need to change it in catalog.yml, and re-run the pipeline, error is thrown again
  5. session has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreate session and do something similar to session.run
  6. Create a new session or %reload_kedro?
  7. Now catalog.load that persisted dataset, i.e. func(catalog.load("some_data"))
  8. Copy the source code of func to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that’s a lot more copy-pasting and change of import maybe.
  9. Change the source code and make it work in the notebook
  10. Rerun the pipeline to ensure everything works

Note that if this is a local development environment, all you would do is set a breakpoint. But you will have to touch on a few files with a notebook i.e.

  • Notebook cell
  • Source code of the function that cause the problem
  • catalog.yml

Problems

  • KedroSession cannot be re-run, user will called session.run multiple time for debugging purpose.
  • Sometimes session.run doesn’t give the correct output and this issue try to address this problem #1802
  • Errors happened in the 50th node in a 100 nodes pipeline - how can we remove some steps so less copy & paste is needed?
  • Not all nodes write data to disk - this means they can’t be recovered easily. It makes sense to have most things in memory, but can we make it easier for debug sessions, where users can change this behavior instead of going to change every entry in a catalog?

Why it would be less of a problem with stuff like Airflow?

  • All datasets are persisted - so each node is a self-contained node, you only need to rerun the interested node
  • UI will show you clearly which node fails

Proposal

We are definitely not trying to re-create the debugger experience. Ideally, it would be great if Kedro can just pop out the correct context at the exact line of code (similar to putting a breakpoint right before an error happen).

  • Easier way to re-use session? or %reload_kedro enough? If we want to keep things in memory then reload_kedro does not fit well.
  • %load_node proposal mentioned in #1721 - which should address Step1-Step7
  • Some debug mode with session which you can do session.run(dataset=["a","b","c"]) and keep specific dataset you are interested, or even

Some of this can reuse the backtracking logic we have in #1795 so we don’t have to rerun the entire pipeline.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:3
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
merelchtcommented, Nov 9, 2022

Discussed in Technical Design

The general agreement is that we need to improve the debugging workflow of Kedro in notebooks. Concrete actions to achieve this:

0reactions
noklamcommented, Nov 9, 2022

Potentially useful IPython magic

  • get_ipython().set_next_input(s)
  • %debug
  • %load
  • from inspect import getsource
Read more comments on GitHub >

github_iconTop Results From Across the Web

Debugging — Kedro 0.18.4 documentation
Debugging nodes outside the run session isn't very helpful because getting access to the local scope within the node can be hard, especially...
Read more >
Debugging — Kedro 0.17.5 documentation - Read the Docs
Debugging nodes outside the run session isn't very helpful because getting access to the local scope within the node can be hard, especially...
Read more >
Debugging — Kedro 0.18.1 documentation
If you have long running nodes or pipelines, inserting print statements and running them multiple times quickly becomes a time-consuming procedure. Debugging ......
Read more >
Debugging — Kedro 0.17.2 documentation - Read the Docs
Debugging nodes outside the run session isn't very helpful because getting access to the local scope within the node can be hard, especially...
Read more >
Use Kedro with IPython and Jupyter Notebooks/Lab
You may want to use a Python kernel inside a Jupyter notebook (formerly known ... Built into the Kedro Jupyter workflow is the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found