question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Synthesis of user research when using configuration in Kedro

See original GitHub issue

Summary

Configuration overhead is an issue that has arisen time and time again from user feedback – particularly as Kedro projects scale in complexity. From user interviews, it can be observed the three main configurations used were for kedro run, the Data Catalog and parameters. The remaining options were seen as “setup once and forgot” for the remainder of the project. Overall configuration in Kedro is well received and liked by users who appreciate the approach Kedro has taken so far.

During this research, it became clear that configuration scaling impacts a small set of use cases where you have multiple environments (e.g. dev, staging and prod) and multiple use cases – maybe you’re using the same or a similar pipeline across different products for different countries. To gather deeper insights participants were presented with two existing options for the Data Catalog, and two possible solutions: pattern matching and Jinja templating (favouring the former of the two). Users were also asked about their feelings about moving the Data Catalog entirely to Python. Participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.

Table of Contents

  1. Introduction
  2. Background
  3. Research Approach
  4. User Interview Matrix
  5. Configuration Synthesis
  6. GitHub Analysis
  7. Data Catalog Generator
  8. Solution Criteria

1. Introduction

Configuration overhead is an issue that has arisen time and time again from user feedback – particularly as Kedro projects scale in complexity. It’s also an issue for new users who have never been exposed to this concept i.e., Data Scientists using software engineering principles for the first time. This research aims to understand the key pain points users face when using configuration and test possible solutions for the Data Catalog to develop a specification criterion for any solution.

2. Background

Kedro is influenced by the 12 Factor App but this results in a lot of duplication of configuration. From users, we have heard that yaml files can become unwieldy with each entry written manually making it error prone. Users also want to apply runtime parameters and want to parameterise runs in complex ways which Kedro doesn’t currently support.

As a result, some teams have tried to solve this independently – most notably by using Jinja2 templating through the template config loader though this has not become widespread across other teams. However, as we continue to grow, it is likely that more users will encounter similar issues and will need a Kedro native solution to support growth.

Finally, this is not a problem unique to Kedro. Google SREs have already faced a similar issue in the past who have outlined their thoughts and experiences here.

3. Research Approach

To develop a holistic overview of configuration in Kedro, a journalistic approach was used. Therefore, we were looking to answer the following questions:

  • Who is using configuration in Kedro?
  • What are they configuring in Kedro?
  • When are they using configuration in Kedro?
  • Why are they using Kedro configuration?
  • Where are they configuring Kedro?
  • How are they configuring Kedro?

Note: There is some overlap in the last two questions.

Research Scope

To help keep things manageable, the primary focus of this research was on the Data Catalog and how users interact with it. Nonetheless, pain points for other forms of configuration in Kedro were also captured and will be discussed later. Therefore, elements like parameters, credentials, etc. were not explicitly user tested. Furthermore, custom solutions created by teams may be referenced but will not be considered in the overall solution as they are not Kedro native features.

4. User Interview Matrix

In total, 19 interviews (lasting 1 hour each) across personas and experience levels were conducted to capture a spectrum of views. The user matrix breakdown is shown below.

Data Sci. Data Eng. Verticals External Total
Beginner 2 0 1 0 3
Intermediate 3 1 0 1 5
Advanced 3 3 3 2 11
Total 8 4 4 3 19

Note: External users were sourced from Kedro Discord

5. Configuration Synthesis

Kedro kedro run Template Config Loader Credentials Config Environments Parameters Data Catalog
Technology

What technology is currently used to support this configuration?
Python YAML Python
Jinja
YAML YAML YAML YAML
Touchpoint

Where in the Kedro project can the user make this configuration?
src/<project-package>/settings.py
pyproject.toml
kedro run --config **.yml
export KEDRO_ENV=xyz
src/<project-name>/hooks.py conf/**/credentials.yml conf/base/**.yml
conf/local/**.yml
conf/**/**.yml
export KEDRO_ENV=**
conf/**/parameters.yml
kedro run --params param_key1:value1,param_key2:2.0
kedro run --config **.yml
conf/**/catalog.yml
Ownership

Who is the lead user responsible for this configuration?
DE (50%) – TD (50%) DE (50%) – DS (50%) DE (80%) – DS (20%) DE (80%) – DS (20%) DS (100%) DE (20%) – DS (80%) DE (50%) – DS (50%)
User Sentiment

How does the user feel about this approach?
😀 😐 🙂 - 😐 😀 🙂 - 😐 😀 😀
Benefits

What do users like about this approach?
• It’s open source
• Standard project structure
• Easy to collaborate with others
• Provides great defaults out of the box
• Easy to ramp up a Kedro project
• Single point of entry to run code

• Can use the –pipeline flag to run specific branches of code• Can git commit a config.yml file to reduce run errors
• Easy to setup
• Overall, one of the easiest things to work with
• Enables automation and scaling of Kedro
• Easy to collaborate with others
• A properly written hook can save lots of time
• Enforces best practices around managing credentials
• Works as it should and is seamless
• Can handle a variety of credentials out of the box
• Each person can have their own setup to access data
• Fairly simple to use
• Enables a structured approach to dev/qa/prod
• Globals.yml can be different for each environment
• Decouples code and config
• Helps teams test and prototype in environments in a risk-free way
• Creates a structured way of working
• Easy and straightforward to use
• Easy to read and maintain
• Like the “params:” prefix to quickly identify them in code
• Viewed as the best feature of Kedro
• Declarative syntax makes it easy to use, read and debug

• Simplification of I/O

• Decouples code and I/O
• Already has many data connectors built in
• Transcoding datasets
Pain Points

What are the pain points of this configuration?
• Breaking changes between 16 and 17
• Running into issues with kedro install on Windows
• Changes to hooks and pipeline registry between versions
• Can be difficult to run a single node
• Arguments in the terminal are not version controlled
• --nodes is node_names in the yml file
• Depending on what you are using it for - can mix code and config to an extent and lose traceability
• You need some knowledge to setup - not easy for beginners
• Can reduce transparency of code.
• Users might have the idea - but they don’t always find it easy to implement
• Jinja was not well received by clients
• Cannot inject credentials at runtime
• For beginners, can be a little hard to grasp why credentials are separated from the Data Catalog or code
• Feels misaligned with CI/CD tooling
• Can be easily abused by teams for other purposes
• The inheritance pattern of local / custom / base can be hard for new users to pick up
• Only top-level key supported
• Parameters not inheriting base keys and you need to overwrite the entire entry
• Repetition and duplication of files
• Can grow to large files leading a very nested dictionary
• Cannot have ranges or step increments
• Little IDE support means you need to follow the logic yourself
• Repetition of entries
• Duplication of files
• Minor changes to entries need to be applied everywhere - can be difficult to sync
• Not easy to write a custom class for unsupported datasets
• For some teams, YAML anchors are beyond their skillset
• Very long catalog files
Feature Requests

What new features are users requesting to support their work?
• Include CI/CD defaults out of the box
• More documentation for migrations with breaking changes
• Common hooks templated by default
• Hooks for when a model starts and ends
• Have nested dependencies in globals.yml
• Easy way to sync these with environment variables
• Enable flexible inheritance across environments
• Greater understanding of where config ends, and environments begin
• Provision to separate use cases and environments
• Would like a parameter.load similar to the catalog
• Implement namespaces to parameters
• More dynamic entries i.e., ranges
• Default YAML included a link to the docs that clearly showed which Datasets were supported
• Address the repetition and duplication of catalogs
• More guidance on picking the best datatype for an entry
• Support more upcoming datasets i.e., TensorFlow

Overall configuration in Kedro is well received and liked by users. No column had a particularly negative response and users largely understood and appreciated the approach Kedro has taken so far. During this exercise, it became clear that configuration scaling impacts a small set of use cases summarised in the table below.

Single Environment Single Environment Multiple Environments Multiple Environments
Single Country Multiple Countries Single Country Multiple Countries
Single Use Case
Multiple Use Cases

This would indicate that large configuration files are mostly seen internally often on large analytics project. This stems from Kedro not supporting multiple uses in a monorepo, therefore, forcing the user to use Config Environments as a stop-gap solution. This however then prevents teams from using it for its intended purpose of separating development environments.

6. GitHub Analysis

To support qualitative insights from user research, a custom GitHub query was created to gather quantitative on the Data Catalog.

At the time of running (18 Aug 2021) this presented 411 results of which 138 were real Kedro Data Catalog files. Note, empty Data Catalogs, spaceflights or iris examples and non Kedro projects were manually filtered out. This query assumes that these files are representative of open-source users and that Data Catalogs follow the /conf/ folder structure. Furthermore, it’s impossible to determine if these are complete files of finished projects or still under development.

From this, it was found that only 9% of users were using YAML anchors and only 2% were using globals.yml. However, 89% of users were using some type of namespacing in their catalog entries. Furthermore, the number of Data Catalog entries per file were counted. From the histogram below, Data Catalog entries peak around 10.

GH Catalog Search

7. Data Catalog Generator

To better understand what users need from the Data Catalog, users were presented with possible options using prototype code. Participants were presented with two existing options for the Data Catalog, and two possible solutions: pattern matching and Jinja templating (with users favouring the former of the two). Users were also asked about their feelings about moving the Data Catalog entirely to Python. Here, participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.

Vanilla Kedro YAML Anchors Pattern Matching Jinja Templating Python
Positives • Viewed as the best feature of Kedro
• Declarative syntax makes it easy to use, read and debug
• Simplification of I/O
• Decouples code and I/O
• Already has many data connectors built in
• Transcoding datasets
• Reduces the level of repetition in a file

• Still easy to read and debug

• Built in YAML feature, so used in other tools that use YAML
• Fairly easy to understand compared to Jinja and YAML anchors
• Still somewhat declarative
• Drastically reduces the number lines
• Viewed beginner friendly
• Takes away additional steps of having declare new files in the Data Catalog
• Can see the actual entries through the syntax
• Somewhat established in the Python world so may have already used it elsewhere
• Reduces the number lines but not as much as Pattern Matching
• Greater control between memory and file datasets
• Access to StackOverflow to help debug issues
Negatives • Repetition of entries
• Duplication of files
• Minor changes to entries need to be applied everywhere - can be difficult to sync
• Not easy to write a custom class for unsupported datasets
• For some teams, YAML anchors are beyond their skillset
• Very long Data Catalog files
• Built in YAML feature, so used in other tools that use YAML
• Users were using it without knowing they are using it
• Getting accustomed to the notation can take a while to learn and fully understand
• Sub-keys are declared elsewhere which impacts readability
• Masks the true number of datasets
• Concern about the order of operations
• Doesn’t work for raw datasets
• Breaks when the files have different schema definitions in the Data Catalog entries
• Concern about unintended consequences
• Doesn’t solve the file duplication problem
• Same naming structure doesn’t mean files have the same structure
• Multiple points of failure which also makes it difficult to debug
• Doesn’t work for raw datasets
• User experience suggest beginners struggle to use and understand it - some teams have even removed it completely from their work
• Can over complicate the Data Catalog with logic
• Breaks when the files have different schema definitions in the Data Catalog entries
• Doesn’t solve the file duplication problem
• Bigger learning curve compared to previous options
• Whitespace control can be difficult to manage
• Users universally were very against the idea of moving the Data Catalog to python
• Mixes code and I/O which goes against Kedro principles
• Considered very unfriendly - especially for non-tech users
• Huge concerns on giving too much freedom to users who might abuse this flexibility

8. Solution Criteria

While it was important to test the ideas, it was even more important to understand the criteria of a successful solution that would improve the experience of using the Data Catalog. Therefore, users identified the following 7 components:

  1. Readability
  2. Declarative Syntax
  3. Beginner Friendly
  4. Client Friendly
  5. Reduce Repetition
  6. Reduce Duplication
  7. Backwards Compatibility

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:11
  • Comments:18 (16 by maintainers)

github_iconTop GitHub Comments

3reactions
Isy89commented, Sep 15, 2021

Interesting analysis! I personally like Jinja2 template system. The only problem I sow with it regards readability. I generally have to run the same pipeline with different inputs and store results in different locations. So, data catalogs and config files became for me an important tool to keep track of which data was analyzed and where it was stored. This is a problem if the only thing I have is the template, because this information is lost. Furthermore, an explicit catalog like the vanilla one (or the one using YAML anchors) becomes really important for reproducibility, like in the case I have to regenerate the same results again. In my case, I use, from one side, the template system to make the process of using different inputs and outputs easier, from the other side, I store the generated catalog and the config files together with the data to ensure readability and reproducibility. It would also be nice if there would be a way to programmatically run the pipeline by passing the dictionaries of variables to the TemplatedConfigLoader through the CLI, instead of having to manually set them in the hooks or having to override the run command and subclass the KedroSession to achieve it.

3reactions
deepyamancommented, Sep 14, 2021

Not a fan of pattern matching; will copy my comments to @hamzaoza here:

  • Too much “magic”: With pattern matching, your catalog is defined by the pipeline catalog entries; you can’t just look at the catalog and know what the different catalog entries are.
  • Custom DSL bad: Jinja is standard, and familiar to people who’ve done stuff like web dev. Even if most Kedro users don’t come from that background, a custom DSL isn’t necessarilyy better in that regard. Jinja is new to unfamiliar users, just like a DSL would be, but at least you can find plenty of other resources on Jinja.
  • Pattern matching in other languages isn’t a good analog: To be perfectly honest, I wasn’t really familiar with pattern matching in Scala or Haskell. Haskell’s looks more similar to what I saw in the demo for Kedro. However, for both of these, pattern matching works more like an else statement, which is not really how I think about the default state of my catalog entry.

Sidebar: Just give me autoformatting with prettier on YAML templated using Jinja, and I’m happy. That’s my main gripe with templating with Jinja (that prettier no longer works).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Build an Anomaly Detection Pipeline with Isolation Forest and ...
This article explores the concepts behind data science pipelines and how to leverage Kedro to create a financial data anomaly detection pipeline.
Read more >
Deploying a Recommendation System the Kedro Way
A tutorial to create a recommender pipeline with Kedro and MLFlow. ... Recommendation systems are integral to the modern Internet. There is no...
Read more >
kedro.framework.context — Kedro 0.18.4 documentation
KedroContext is the base class which holds the configuration and Kedro's main functionality. Error occurred when loading project and running context pipeline.
Read more >
How to Synthesize User Research Data in 14 Steps - Aurelius
UX research synthesis allows you to gain insights and make recommendations for product teams to act on. The ability to turn research into...
Read more >
Demystifying MLOps and Presenting a Recipe for the ... - MDPI
For this, a simple MLOps workflow for object detection with images is portrayed. ... data preparation, model training, model testing, and deployment.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found