question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluating the Kedro and Databricks workflow

See original GitHub issue

Introduction

We’ve entered the battleground of ML development workflows, a notebook-driven approach vs one primarily written using an IDE like VS Code, PyCharm and others. ⚔️

Why should ML developers use an IDE instead of a notebook to develop their data and ML pipelines?

The notebook-driven approach is challenged when producing a code base that needs to be maintained. In addition, it is challenging to write tests and documentation, leverage version control systems, sort out a pipeline’s running order and collaborate with others when working with notebooks.

You don’t need to take our word for this but rather reflect on the perspective of the Databricks Labs team that are aware of this problem:

“As projects on Databricks grow larger, Databricks users may struggle to keep up with the numerous notebooks containing ETL, data science experimentation, dashboards etc. While there are various short-term workarounds, such as using the %run command to call other notebooks from within your current notebook, it’s useful to follow traditional software engineering best practices of separating reusable code from pipelines calling that code. Additionally, building tests around your pipelines to verify that the pipelines are also working is another important step toward production-grade development processes.”

The Databricks Labs team have piloted multiple projects to allow users to leverage an IDE-based workflow, including CI/CD templates, Databricks Connect, dbx, Databricks Repos and the Databricks CLI.

Why does this affect Kedro?

Kedro suggests an IDE-based workflow and proposes that notebooks are suitable for prototyping, not the final code base. This tension often reveals itself when non-Kedro users primarily rely on Jupyter notebooks and when Kedro users interact with notebook-driven platforms like Databricks.

Why should we care?

We have a growing category of Kedro users that rely on Databricks to scale their data and machine-learning pipelines. We have also seen 341 queries related to Databricks on our Q&A forums; comparatively, there are 37 queries about AWS Sagemaker. This data results from a term search from our internal Slack channel and open Discord forum.

Our Databricks deployment documentation is also the most viewed in the deployment series. We also have some qualitative evidence to suggest that the current development and deployment experience is a barrier to adopting Kedro in organisations that rely on Databricks.

What is the scope of our work?

This exercise aims to define a seamless development and deployment experience for Kedro users on Databricks. We will target ML developers that prefer an IDE-based workflow, use Databricks to support their PySpark workflows and leverage Kedro to author their data and ML software; out of scope are users that solely rely on a notebook-based approach. We are exploring ways to help this second group in our improvements to the iPython and Jupyter Notebooks workflows.

The first part of our work will focus on a research assignment to understand the landscape of our users’ problems related to the IDE workflow on Databricks and leveraging Kedro on Databricks. We’ll use interviews to get an initial lay-of-the-land, a survey to source quantitative data and observation (screen recordings) or role-play studies to reproduce workflow errors.

What are we hoping to understand?

At the end of this research study:

  • We should have a prioritised list of pain points according to the following categories:
    • The IDE workflow in Databricks,
    • Kedro on Databricks,
    • And potentially even just Kedro issues;
  • We also should understand our users’ key workflows,
    • Workarounds they may have created,
    • And be able to communicate the value of using the Kedro and Databricks together.

This work will feed into the Kedro backlog and a hackathon that we will be planned with the Databricks team.

Who will we be speaking to?

  • Maria Olivia Lihn
  • Debanjan Banerjee
  • Diana Montanes
  • Roman Drapeko
  • Nishant Kumar-NKC
  • Saravanakumar Subramaniam
  • Benjamin Levy
  • Eduardo Coronado
  • Ingo Walz
  • Danny Farah
  • Poornima Ponthagani
  • Avaneesh Yembadi
  • Anil Chouldary
  • Logan Rupert
  • @marioFeynman
  • @Malaguth
  • @WolVez

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:13
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

12reactions
yetudadacommented, Jul 11, 2022

Interview questions

Introduction

  1. Who are you, and what do you do at your company?
  2. Can you tell me about the last three pieces of work you have been involved in?
    • Have you used Databricks in your previous three projects?
    • Have you used Kedro in your previous three projects?
  3. Can you describe the last project you used Kedro on Databricks?
  4. What version of Kedro did you use?
  5. [Bonus] Would you be in a position to show us this project so that we can walk through it together?

I might verify if Q5 is possible before the interview.

Workflow on the last project that you used an IDE, Kedro and Databricks together

Databricks

  1. What is your workflow with an IDE on Databricks?
  2. Which parts of Databricks did you use on this project, and why?
  3. Did you use Databricks with any other cloud platform tooling, e.g. Azure or AWS, on this project? And if “yes”, what other parts did you use?

Use of Kedro on Databricks

  1. Why did you use Kedro and Databricks together on this project?
  2. What is your overall experience using Kedro on Databricks on this project?
    • Did you run into any errors or challenges using Kedro on Databricks for this project?
    • How did you solve these problems?
  3. What should we do to improve the Kedro/Databricks workflow, and why?
  4. If we improve this Kedro/Databricks workflow based on your recommendation, would you use Kedro on Databricks in the future?
  5. Have you tried to use Kedro-Viz on this project?

Workflow

  1. Can you describe what steps you took to set up your Kedro project on Databricks for this project?
  2. Can you describe the steps you took when you changed your code base?
  3. Can you describe what steps you took when you wanted to release a new version of your code?

Conclusion

  1. What other challenges have you encountered with the Kedro and Databricks workflow?
  2. Is there anything else you want to mention?
6reactions
vitoravancinicommented, Oct 25, 2022

Hello Kedro People, was this evolved somewhere/somehow? I’m very interested in this topic, at the company I work at we’ve been using kedro-connect with fair success, but it seems databricks wont continue it, has anyone tried with dbx?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Deployment to a Databricks cluster - Kedro - Read the Docs
The workflow described in this section may be useful for experimentation and initial data analysis stages, but it is not designed for productionisation....
Read more >
A Collaborative Data Science Development ... - SlideShare
Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once ...
Read more >
A Collaborative and Scalable Machine Learning Workflow
A cost effective and scalable process for collaborative Machine Learning (ML) R&D. • A framework for comparing experiments and deploying production models.
Read more >
How to run Kedro pipelines on Azure ML Pipelines service?
DOWNLOAD FREE MLOps ebook: Power Up ML Process. Build Feature Stores Faster ⬇ https://bit.ly/3DRXVAS ⬇A step-by-step guide to building a ...
Read more >
Kedro: A New Tool For Data Science | by Jo Stichbury
Kedro is a development workflow tool that allows you to create portable data pipelines. It applies software engineering best practices to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found