Minor improvements in the IPython and Jupyter Notebook workflows
See original GitHub issueContext
From our experience in supporting our users as well as from simply reading our guide on the integration with IPython and Jupyter, we know that there are a number of challenges for users to work with Kedro from notebooks.
- There are many ways to do the same thing
- The
.ipython/
folder in our projects makes our templates more cluttered and incomprehensible - It is harder to maintain backwards compatibility when our IPython/Jupyter workflow relies on template code under
.ipython/
- Our
kedro ipython
,kedro jupyter lab/notebook
helpers don’t work for managed Jupyter instances - For managed Jupyter instances, our users need to manually add extra scripts like ipython_loader.py
- Our users have reportedly made custom scripts to cater for common workflows like preloading all dataset inputs for a specific node
- Converting Jupyter Notebook code to Kedro nodes is still primarily done manually despite our
kedro jupyter notebook convert
CLI command
These challenges are not exhaustive, but they arguably present a significant barrier for Jupyter Notebook users interacting with Kedro and make up for an unpleasant experience.
Proposal
In order to improve the experience without major changes in Kedro, not long ago we have started the development of a Kedro IPython extension which was meant to replace the startup script in the .ipython/
directory. The extension has a full feature parity already with the startup script for IPython sessions and after https://github.com/quantumblacklabs/kedro/commit/7613deccdf6391501e243c91512711ac00d1c78f it will be the primary way our IPython/Jupyter users will interact with Kedro.
As next steps, I suggest that we aim for the following unified workflow based entirely on our IPython extension:
IPython
If the user can start the session themselves:
cd <kedro-project-root>/
ipython --ext="kedro.extras.extensions.ipython"
If the user is in an existing IPython session they cannot or do not want to restart:
In [1]: %load_ext kedro.extras.extensions.ipython
In [2]: %reload_kedro <path_to_project_root>
Jupyter
For Jupyter, there will be only one way to load the extension and that will happen per notebook:
In [1]: %load_ext kedro.extras.extensions.ipython
In [2]: %reload_kedro <path_to_project_root>
This should work for both local Jupyter setup and managed Jupyter instances.
IPython and Jupyter with preloaded Kedro extension
A new Kedro command should be created which is meant to be run once and enable Kedro’s extension in the user’s ~/.ipython/
folder. All Jupyter and IPython sessions started after this will have the Kedro IPython extension preloaded.
kedro ipython-init
The command will be a top-level command, without the need of an existing Kedro project. The name of the command is up for debate.
After this, Kedro projects will no longer need to have an .ipython/
folder in them.
Future
Once we have successfully migrated the community away from the old way of interacting with Kedro from IPython and Jupyter, we can continue the development of the plugin and add the following capabilities
Running an IPython session with preloaded datasets for a node
After running this in a Kedro project
kedro ipython --node example_node
we can preload the datasets which are inputs to this node, thus allowing the user to debug their pipeline at a particular node. This functionality is something already in use by internal teams, although they have their own scripts to facilitate it.
Jupyter extension to allow node editing
Jupyter provides an API for custom content loading. We can use this API and develop a Kedro Jupyter Notebook Server extension, which will allow us to edit nodes from Jupyter and browse them through their Kedro node name rather than their filename. This is what will enable us to integrate Jupyter notebooks in Kedro Lab.
This extension is contingent on the existence of the IPython session with preloaded datasets for a node, which will make up for a seamless experience.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (10 by maintainers)
Here’s what I meant by creating a custom
ContentsManager
in Jupyter. Here’s what we can do currently in Kedro Viz:https://user-images.githubusercontent.com/3482916/153586609-f2b74441-f38e-435b-bc95-6c4d14fec230.mov
And here’s what we could have if we create a custom
ContentsManager
and startJupyter Lab
in a Kedro project:https://user-images.githubusercontent.com/3482916/153586668-05c7d25e-c683-4b99-b6eb-04b97905e2d5.mov
As you can see, instead of folders, we can show pipeline namespaces, and instead of files, we can show node names and directly edit them. Making this work will get us very close to enabling the same directly in Kedro Viz, which will be a very nice addition and make Data Science workflows much easier than what we have currently.
As a way to progress forward on this one, we should look into the following steps:
kedro activate-nbstripout
as suggested by @yetudadakedro jupyter
as it is rarely used and no longer needed or relevantkedro ipython
as it is rarely used and no longer needed or relevantThose command provide very little to the user anyway, since they are wrappers around calling
ipython
orjupyter
and are not relevant for managed instances of Jupyter, which is probably the most common way our users use Jupyter (think of Databricks and other managed solutions).Instead of having those commands, we should make sure that loading the Kedro extension is the only widely known alternative, as well as provide a very small number of steps for this to happen. So a set of other tasks need to be completed:
kedro.ipython
tokedro.extras.extensions.ipython
kedro ipython
everywhere in our docs withipython --ext="kedro.ipython"
--ext="kedro.ipython"
kedro jupyter-setup
command (name is up for debate) which will create a Jupyter Kernel to automatically load the Kedro extensionjupyter
is installed, else show a warning Jupyter is not installedipython_loader.py
for 0.18.ipython/
folder from all starterskedro.ipython
to show errors on loading in the IPython session when started (addressing this issue)kedro.ipython
to simplify itcatalog
,context
,session
andpipelines
catalog
is the only useful one since running pipelines from notebooks is not a common pattern and probably shouldn’t be done after we close this oneparameters
orparams
Some of those changes will be breaking changes and probably worth to try implementing them for Kedro 0.18 (to be discussed, since that might require us to add deprecation warnings in a small 0.17.8 release which is not ideal).
Me and @AntonyMilneQB will turn those steps into issues and put them on our backlog and once we complete them all, we’ll revisit this discussion and see how we can build on that to provide even better Jupyter experience.