question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Minor improvements in the IPython and Jupyter Notebook workflows

See original GitHub issue

Context

From our experience in supporting our users as well as from simply reading our guide on the integration with IPython and Jupyter, we know that there are a number of challenges for users to work with Kedro from notebooks.

  • There are many ways to do the same thing
  • The .ipython/ folder in our projects makes our templates more cluttered and incomprehensible
  • It is harder to maintain backwards compatibility when our IPython/Jupyter workflow relies on template code under .ipython/
  • Our kedro ipython, kedro jupyter lab/notebook helpers don’t work for managed Jupyter instances
  • For managed Jupyter instances, our users need to manually add extra scripts like ipython_loader.py
  • Our users have reportedly made custom scripts to cater for common workflows like preloading all dataset inputs for a specific node
  • Converting Jupyter Notebook code to Kedro nodes is still primarily done manually despite our kedro jupyter notebook convert CLI command

These challenges are not exhaustive, but they arguably present a significant barrier for Jupyter Notebook users interacting with Kedro and make up for an unpleasant experience.

Proposal

In order to improve the experience without major changes in Kedro, not long ago we have started the development of a Kedro IPython extension which was meant to replace the startup script in the .ipython/ directory. The extension has a full feature parity already with the startup script for IPython sessions and after https://github.com/quantumblacklabs/kedro/commit/7613deccdf6391501e243c91512711ac00d1c78f it will be the primary way our IPython/Jupyter users will interact with Kedro.

As next steps, I suggest that we aim for the following unified workflow based entirely on our IPython extension:

IPython

If the user can start the session themselves:
cd <kedro-project-root>/
ipython --ext="kedro.extras.extensions.ipython"
If the user is in an existing IPython session they cannot or do not want to restart:
In [1]: %load_ext kedro.extras.extensions.ipython
In [2]: %reload_kedro <path_to_project_root>

Jupyter

For Jupyter, there will be only one way to load the extension and that will happen per notebook:

In [1]: %load_ext kedro.extras.extensions.ipython
In [2]: %reload_kedro <path_to_project_root>

This should work for both local Jupyter setup and managed Jupyter instances.

IPython and Jupyter with preloaded Kedro extension

A new Kedro command should be created which is meant to be run once and enable Kedro’s extension in the user’s ~/.ipython/ folder. All Jupyter and IPython sessions started after this will have the Kedro IPython extension preloaded.

kedro ipython-init

The command will be a top-level command, without the need of an existing Kedro project. The name of the command is up for debate.

After this, Kedro projects will no longer need to have an .ipython/ folder in them.

Future

Once we have successfully migrated the community away from the old way of interacting with Kedro from IPython and Jupyter, we can continue the development of the plugin and add the following capabilities

Running an IPython session with preloaded datasets for a node

After running this in a Kedro project

kedro ipython --node example_node

we can preload the datasets which are inputs to this node, thus allowing the user to debug their pipeline at a particular node. This functionality is something already in use by internal teams, although they have their own scripts to facilitate it.

Jupyter extension to allow node editing

Jupyter provides an API for custom content loading. We can use this API and develop a Kedro Jupyter Notebook Server extension, which will allow us to edit nodes from Jupyter and browse them through their Kedro node name rather than their filename. This is what will enable us to integrate Jupyter notebooks in Kedro Lab.

This extension is contingent on the existence of the IPython session with preloaded datasets for a node, which will make up for a seamless experience.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

5reactions
idanovcommented, Feb 11, 2022

Here’s what I meant by creating a custom ContentsManager in Jupyter. Here’s what we can do currently in Kedro Viz:

https://user-images.githubusercontent.com/3482916/153586609-f2b74441-f38e-435b-bc95-6c4d14fec230.mov

And here’s what we could have if we create a custom ContentsManager and start Jupyter Lab in a Kedro project:

https://user-images.githubusercontent.com/3482916/153586668-05c7d25e-c683-4b99-b6eb-04b97905e2d5.mov

As you can see, instead of folders, we can show pipeline namespaces, and instead of files, we can show node names and directly edit them. Making this work will get us very close to enabling the same directly in Kedro Viz, which will be a very nice addition and make Data Science workflows much easier than what we have currently.

2reactions
idanovcommented, Mar 10, 2022

As a way to progress forward on this one, we should look into the following steps:

  • Drop kedro activate-nbstripout as suggested by @yetudada
  • Drop kedro jupyter as it is rarely used and no longer needed or relevant
  • Drop kedro ipython as it is rarely used and no longer needed or relevant

Those command provide very little to the user anyway, since they are wrappers around calling ipython or jupyter and are not relevant for managed instances of Jupyter, which is probably the most common way our users use Jupyter (think of Databricks and other managed solutions).

Instead of having those commands, we should make sure that loading the Kedro extension is the only widely known alternative, as well as provide a very small number of steps for this to happen. So a set of other tasks need to be completed:

  • Alias kedro.ipython to kedro.extras.extensions.ipython
  • Replace kedro ipython everywhere in our docs with ipython --ext="kedro.ipython"
  • Document how to load the Kedro extension in an existing IPython session, started without --ext="kedro.ipython"
  • Design a kedro jupyter-setup command (name is up for debate) which will create a Jupyter Kernel to automatically load the Kedro extension
    • Use the jupyter client kernelspec API to do that from Python
    • Make the command available only if jupyter is installed, else show a warning Jupyter is not installed
    • Add the Kedro icon to the KernelSpec
  • Delete the ipython_loader.py for 0.18
  • Remove the .ipython/ folder from all starters
  • Devise a mechanism in kedro.ipython to show errors on loading in the IPython session when started (addressing this issue)
  • Refactor the current kedro.ipython to simplify it
  • Reassess what variables need to be available in the IPython session
    • Currently we provide catalog, context, session and pipelines
    • @AntonyMilneQB made a good point that the catalog is the only useful one since running pipelines from notebooks is not a common pattern and probably shouldn’t be done after we close this one
    • We might need to reintroduce parameters or params
  • Devise a mechanism (maybe a line magic) to set the IPython session in a node debug mode
    • No variables should be available except all datasets/parameters consumed by the node with the data already loaded
    • If possible, the variable names should be the same as the node’s function parameters

Some of those changes will be breaking changes and probably worth to try implementing them for Kedro 0.18 (to be discussed, since that might require us to add deprecation warnings in a small 0.17.8 release which is not ideal).

Me and @AntonyMilneQB will turn those steps into issues and put them on our backlog and once we complete them all, we’ll revisit this discussion and see how we can build on that to provide even better Jupyter experience.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed workflows with Jupyter - ScienceDirect.com
We advocate the Jupyter Notebooks' potential to express complex distributed workflows, identifying the general requirements for a Jupyter-based Workflow ...
Read more >
Jupyter and the future of IPython — IPython
To get started with IPython in the Jupyter Notebook, see our official example ... IPython 7.2: minor bugfixes, improvements, and new configuration options ......
Read more >
IPython Or Jupyter? - DataCamp
IPython has now only two roles to fulfill: being the Python backend to the Jupyter Notebook, which is also known as the kernel,...
Read more >
Interactive Controls in Jupyter Notebooks | by Will Koehrsen
How to use interactive IPython widgets to enhance data exploration and analysis ... There are few actions less efficient in data exploration than ......
Read more >
Using IPython / Jupyter Notebooks Under Version Control
The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found