Evaluating the Kedro and Databricks workflow
See original GitHub issueIntroduction
We’ve entered the battleground of ML development workflows, a notebook-driven approach vs one primarily written using an IDE like VS Code, PyCharm and others. ⚔️
Why should ML developers use an IDE instead of a notebook to develop their data and ML pipelines?
The notebook-driven approach is challenged when producing a code base that needs to be maintained. In addition, it is challenging to write tests and documentation, leverage version control systems, sort out a pipeline’s running order and collaborate with others when working with notebooks.
You don’t need to take our word for this but rather reflect on the perspective of the Databricks Labs team that are aware of this problem:
“As projects on Databricks grow larger, Databricks users may struggle to keep up with the numerous notebooks containing ETL, data science experimentation, dashboards etc. While there are various short-term workarounds, such as using the
%run
command to call other notebooks from within your current notebook, it’s useful to follow traditional software engineering best practices of separating reusable code from pipelines calling that code. Additionally, building tests around your pipelines to verify that the pipelines are also working is another important step toward production-grade development processes.”
The Databricks Labs team have piloted multiple projects to allow users to leverage an IDE-based workflow, including CI/CD templates, Databricks Connect, dbx
, Databricks Repos and the Databricks CLI.
Why does this affect Kedro?
Kedro suggests an IDE-based workflow and proposes that notebooks are suitable for prototyping, not the final code base. This tension often reveals itself when non-Kedro users primarily rely on Jupyter notebooks and when Kedro users interact with notebook-driven platforms like Databricks.
Why should we care?
We have a growing category of Kedro users that rely on Databricks to scale their data and machine-learning pipelines. We have also seen 341 queries related to Databricks on our Q&A forums; comparatively, there are 37 queries about AWS Sagemaker. This data results from a term search from our internal Slack channel and open Discord forum.
Our Databricks deployment documentation is also the most viewed in the deployment series. We also have some qualitative evidence to suggest that the current development and deployment experience is a barrier to adopting Kedro in organisations that rely on Databricks.
What is the scope of our work?
This exercise aims to define a seamless development and deployment experience for Kedro users on Databricks. We will target ML developers that prefer an IDE-based workflow, use Databricks to support their PySpark workflows and leverage Kedro to author their data and ML software; out of scope are users that solely rely on a notebook-based approach. We are exploring ways to help this second group in our improvements to the iPython and Jupyter Notebooks workflows.
The first part of our work will focus on a research assignment to understand the landscape of our users’ problems related to the IDE workflow on Databricks and leveraging Kedro on Databricks. We’ll use interviews to get an initial lay-of-the-land, a survey to source quantitative data and observation (screen recordings) or role-play studies to reproduce workflow errors.
What are we hoping to understand?
At the end of this research study:
- We should have a prioritised list of pain points according to the following categories:
- The IDE workflow in Databricks,
- Kedro on Databricks,
- And potentially even just Kedro issues;
- We also should understand our users’ key workflows,
- Workarounds they may have created,
- And be able to communicate the value of using the Kedro and Databricks together.
This work will feed into the Kedro backlog and a hackathon that we will be planned with the Databricks team.
Who will we be speaking to?
- Maria Olivia Lihn
- Debanjan Banerjee
- Diana Montanes
- Roman Drapeko
- Nishant Kumar-NKC
- Saravanakumar Subramaniam
- Benjamin Levy
- Eduardo Coronado
- Ingo Walz
- Danny Farah
- Poornima Ponthagani
- Avaneesh Yembadi
- Anil Chouldary
- Logan Rupert
- @marioFeynman
- @Malaguth
- @WolVez
Issue Analytics
- State:
- Created a year ago
- Reactions:13
- Comments:13 (7 by maintainers)
Top GitHub Comments
Interview questions
Introduction
Workflow on the last project that you used an IDE, Kedro and Databricks together
Databricks
Use of Kedro on Databricks
Workflow
Conclusion
Hello Kedro People, was this evolved somewhere/somehow? I’m very interested in this topic, at the company I work at we’ve been using kedro-connect with fair success, but it seems databricks wont continue it, has anyone tried with dbx?