question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How do you feel about the Kedro project template?

See original GitHub issue

Introduction

The joy about open-sourcing Kedro is that we’ve been exposed to diverse opinions and use cases that we didn’t think Kedro covered. We’re going to proactively ask for feedback and you will see a lot more “How do you feel about …” GitHub issues raised by the Kedro maintainers as we try to capture your thoughts on specific issues.

First in the series was a question around introducing telemetry into Kedro and this one is about the project template.

Context

The Kedro project template is based on a template derived from CookieCutter Data Science. Some of our open-source users immediately picked up this relationship and have said that Kedro is a version of CookieCutter Data Science that thought about a pipeline framework, data abstraction, and versioning.

The project template is core to us being able to help you create reusable analytics code, according to the CookieCutter Data Science philosophy, but we’ve had feedback that the template is considered overwhelming for new users because they’re not sure why we create so many directories. We’ve also observed users not using all of the template, or even removing generated folders in their templates.

Examples of directory removal are present here:

Possible Implementation

There’s thought around removing non-essential folders and creating directories when certain actions are taken.

We’re proposing the following categorization:

  1. core directories are essential for Kedro
  2. nice to have directories are linked to functionality that extends Kedro
  3. non-essential directories can be removed and do not extend functionality in Kedro
Folder Description Category Proposed Action
conf The conf directory is the place where all your project configuration is located. Using conf encourages a clear and strict separation between project code and configuration. Core Keep
data A place to store local project data according to a suggested Data Engineering Convention. For production workloads we do not recommend storing data locally, but rather utilizing cloud storage (AWS S3, Azure Blob Storage), distributed file storage or database interfaces through Kedro’s Data Catalog. Nice to have Keep but remove sub-directories that indicate Data Engineering Convention
docs docs is where your auto-generated project documentation is saved Nice to have Create this directory when kedro build-docs is run
logs A directory for your Kedro pipeline execution logs Nice to have Create this directory when kedro run is run
notebooks Kedro supports a Jupyter workflow, that allows you to experiment and iterate quickly on your models. notebooks is the folder where you can store your Jupyter Notebooks Nice to have Keep
references Auxiliary folders for project references and standalone results like model artifacts, plots, papers, and statistics Non-essential Remove
results Auxiliary folders for project references and standalone results like model artifacts, plots, papers, and statistics Non-essential Remove
src Source directory that contains all your pipeline code Core Keep

This would create the following template when you run kedro new:

conf/
data/
notebooks/
src/

Questions for you

Note: These are “yes” and “no” questions but we would like the answers caveated with a reason why you have indicated the following.

We need your help in answering the following:

  • Are our assumptions around priority for directories correct?
  • Do you agree with the proposed actions? Yes, no and why?
  • Do you think that this change would help make Kedro less intimidating for new users of Kedro?
  • Do you have any other thoughts we should consider for the project template?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:22 (16 by maintainers)

github_iconTop GitHub Comments

5reactions
Minyuscommented, Jul 4, 2020

As discussed in https://github.com/quantumblacklabs/kedro/issues/397 and the Zoom meeting with Lais and @921kiyo yesterday, I prepared simplified and enhanced Kedro project template suitable for both beginners and experts at:

https://github.com/Minyus/kedro_template

The major change is that src directory is restructured to top-level main.py and 4 folders (pipelines, nodes, hooks, and catalogs) like this:

├── conf
├── data
├── kedro_cli.py
├── logs
├── main.py
└── src
    ├── __init__.py
    ├── catalogs
    │   ├── __init__.py
    │   └── catalog.py
    ├── hooks
    │   ├── __init__.py
    │   └── add_catalog_dict.py
    ├── nodes
    │   ├── __init__.py
    │   └── my_module.py
    └── pipelines
        ├── __init__.py
        └── pipeline.py

I would suggest this Kedro project template because users can:

  • run the project by either:
    • python main.py (or, for example, /opt/conda/bin/python main.py or /usr/bin/python main.py to use a non-default Python environment):
      • This can allow users to use debugging features of IDEs (VS Code, PyCharm, etc.)
    • kedro run
  • declare datasets in either:
    • catalog.py
    • catalog.yml
  • add hooks easily
4reactions
WaylonWalkercommented, Feb 24, 2020

And how to create starters as suggested by @WaylonWalker, we really enjoyed the idea of being able to do something like kedro new <internal-template-url>

The way that cookiecutter is typically used is to run cookiecutter <url-to-git-repo> So these could simply be separate git repos. As a suggestion to have more official suggested ones you could embed an alias inside of kedro such that running kedro new simple would call cookiecutter https://github.com/quantumblack/simple

I think that it would be really cool to have various projects like the spaceflights tutorial already completed as examples that could easily be accessed with kedro new spaceflights. Then folks can use that as a template, or as a place to start trying out what a kedro pipeline feels like when you are interacting with it.

A way for you to call src by your project name e.g. <proj_name>/<proj_name>. This is a known issue with the way Cookiecutter Data Science creates templates, see here. If you have any ideas then please brainstorm this with us.

There are quite a number of other cookie cutter templates out there. If memory serves me right many of them use the <proj_name>/<proj_name> format. the first one I pulled up was cookiecutter-flask, and it used that format.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do you feel about the Kedro project template? #208
The Kedro project template is based on a template derived from CookieCutter Data Science. Some of our open-source users immediately picked ...
Read more >
Kedro starters — Kedro 0.18.4 documentation
A Kedro starter is a Cookiecutter template that contains the boilerplate code for a Kedro project. To create a Kedro starter, you need...
Read more >
Kedro starters — Kedro 0.18.3 documentation
A Kedro starter is a Cookiecutter template that contains the boilerplate code for a Kedro project. You can create your own starters for...
Read more >
Create a new Kedro project - Read the Docs
The first step is to create the Kedro project using a starter to add the example code and data. Feel free to name...
Read more >
Set up the spaceflights tutorial project - Kedro - Read the Docs
Create a new project containing example code ... Generate a new pipeline template · Add node functions · Assemble nodes into ... Get...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found