question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Define pipelines in YAML files

See original GitHub issue

First of all, thank you very much for the great framework. I’d found myself not working in notebooks for weeks since I’ve switched to Kedro. Now I can debug everything and still have interactive access to my data and models. You’ve changed the quality of my professional life being and I am deeply grateful for this.

Description

Now, my proposal. I think that in addition to existing Python based API for pipeline definition it would be nice to have a possibility to describe pipelines in YAML files. Moreover, the more I am thinking about this the more I’m convinced that this should be the default way of pipeline definition.

Context

I’ve found that in 95% of cases my pipeline.py files are “static”. Meaning that I don’t need to dynamically define pipeline nodes and their IO. Also (my subjective opinion) most “dynamical” use cases can be handled by config templates and we already have this feature.

By putting pipeline definitions into conf we’ll make modular pipelines easier to adjust and configure since any pipeline consumer will have a clear picture of used inputs and outputs which is extremely important since we have “global” dataset naming. We can also mix nodes from different pipelines on a consumer level (check the possible implementation below).

Another reason to use YAML-first pipeline definitions is that it could be much easier to integrate pipeline consistency checks into modern IDEs since we just need to parse YAML files (parameters, pipelines and catalog) and check that the corresponding paths to python modules is correct.

Possible Implementation

We can use the same structure as in the current pipeline.py and hooks.py created by starters and same agreement as in catalog.yml.

Current API

Let’s say I have the following pipeline in src/my_package/analysis/pipelines.py:

from kedro.pipeline import Pipeline, node

from .nodes import select_important_features, visualize

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                select_important_features,
                inputs=dict(
                    dataset='05_model_input/inliers.parquet',
                    cluster_labels='07_model_output/cluster-labels.parquet',
                ),
                outputs='08_reporting/important-features.yml',
            ),
            node(
                visualize,
                inputs=dict(
                    dataset='05_model_input/inliers.parquet',
                    important_features='08_reporting/important-features.yml',
                ),
                outputs=dict(
                    charts='08_reporting/important-features-charts.html',
                    histograms='08_reporting/important-features-histograms.html',
                ),
                tags=['html_reports']
            ),
        ],
        tags=['cluster_analysis'],
    )

I also have to register it in src/my_package/hooks.py:

from my_package.pipelines import analysis

class ProjectHooks:
    @hook_impl
    def register_pipelines(self) -> Dict[str, Pipeline]:
        analysis_pipeline = analysis.create_pipeline()

       # ...

        return {
            'analysis': analysis_pipeline,
            # ...
        }

Proposed YAML API

With YAML API we may do the same in conf/base/pipelines/analysis/pipelines.py:

analysis:
  nodes:
    # or even `- handler: select_important_features` since the pipeline is modular and we have agreement that node handlers live in `nodes.py`
    - handler: my_package.pipelines.analysis.nodes.select_important_features
      inputs:
        dataset: 05_model_input/inliers.parquet
        cluster_labels: 07_model_output/cluster-labels.parquet
      outputs: 08_reporting/important-features.yml
    - handler: my_package.pipelines.analysis.nodes.visualize
      inputs:
        dataset: 05_model_input/inliers.parquet
        important_features: 08_reporting/important-features.yml
      outputs:
        charts: 08_reporting/important-features-charts.html
        histograms: 08_reporting/important-features-histograms.html
      tags:
        - html_reports
  tags:
    - cluster_analysis

And from this we can automatically conclude what should be the registered pipeline name. Meaning, no annoying boiler-plating in the hooks.py is required. Also it will be much easier to mix nodes from several modular pipelines. For example, in the presented use case we might have global reports_publishing pipeline which can be used to publish HTML reports.

We also can use internal YAML templating to create an alias for 05_model_input/inliers.parquet which is the most frequent reason I introduce any “dynamics” into my pipeline definitions.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
idanovcommented, Feb 2, 2021

@Sitin Thank you for your kind words, we’re really happy Kedro changed your way of working for the better and made you happier as a result! Thank you for being part of the community as well by opening this issue.

Currently in Kedro we do not plan to add native support for yml pipeline definitions and we’d rather leave that to plugins to do it for people who are interested. We are aware that some of our users (even internally at QuantumBlack) prefer to define their pipelines in yml, but currently we favour using Python as pipeline definition language.

Just to give more context why, I will list some of our main considerations:

  • as you mentioned, pipelines are static and do not change after you deploy your code
    • catalog.yml, parameters.yml or logging.yml can and should be defined by the DevOps/MLOps person deploying the application, thus they are config
    • pipelines are only ever defined by the developer, thus they are code
  • each node in a pipeline references a Python function
    • defining it in Python helps developers leverage IDE support like syntax highlighting, autocompletion and go-to-definition functionality
  • when you package a project as a Python package, we need to include the pipeline itself too
    • your pipeline is an integral part of your application
    • it needs to be packaged together as the rest of your code
    • packaging is much easier with a Python file

Defining pipelines in Python has some drawbacks as well:

  • Python is too expressive to be a good pipeline definition language
    • not a declarative or markup language (unlike yaml, xml or json)
    • people are tempted to create pipelines dynamically which makes it hard to reason about their application
  • Python is more verbose than a specialised pipeline definition language or a declarative language
    • harder to read and understand what is happening from a first glance

To summarise, Python is too powerful as a pipeline definition language, but on the flip-side has excellent tooling support. Where yaml on the other hand is more concise and closer to being declarative, but tooling support is lacking and if used to pipelines, can easily be mistaken for config.

1reaction
Minyuscommented, Dec 22, 2020

Hi @Sitin

Yes, I hope Kedro natively supports YAML interface for pipeline too.

Meanwhile, I prepared kedro starter templates (based on pandas-iris starter) that work with Kedro 0.17.0 at: https://github.com/Minyus/kedro-starters-sklearn

The YAML pipeline is at https://github.com/Minyus/kedro-starters-sklearn/blob/master/sklearn-mlflow-yamlholic-iris/{{ cookiecutter.repo_name }}/conf/base/parameters.yml#L34-L50

To use YAML interface for pipeline and run config, run:

kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-mlflow-yamlholic-iris

Hooks for MLflow tracking are included, but it should work as is even if MLflow is not installed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

YAML schema reference for Azure Pipelines - Microsoft Learn
The YAML schema reference for Azure Pipelines is a detailed reference for YAML pipelines that lists all supported YAML syntax and their ...
Read more >
Configuring CI/CD Pipelines as Code with YAML in Azure ...
Navigate to the Pipelines hub. · Click New pipeline. · Select the Azure Repos Git as the source hosting platform. · Select the...
Read more >
YAML Pipeline Tutorial, Setting up CI/CD using ... - LetsDevOps
Introduction This article is for understanding the core concept of YAML Pipeline in Azure DevOps. Further it describe how you can write your...
Read more >
Azure DevOps - YAML for CI-CD Pipelines - DotNetCurry.com
YAML stands for (YAML Ain't Markup Language). It is a human friendly serialization language mainly used for configuration files. It can also be ......
Read more >
Include template at top level in Azure Pipeline YAML file
To give a use-case, I have a YAML template called deployment-environment.yml that defines a deployment environment parameter and uses this ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found