question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Support for Hydra in Kedro

See original GitHub issue

Description

Hydra is a framework for elegantly configuring complex applications. It is used to create a hierarchical configuration by splitting it in different yaml files, making it easier to organise. Project description here : https://github.com/facebookresearch/hydra

When trying to use Hydra via the hydra.main() decorator applied on register_pipelines(), an error occurs.

Context

Having Kedro and Hydra working together would make it easier to maintain complex pipelines.

Reproducing issue

python version: 3.8.12 kedro version: 0.17.7 hydra version: 1.1.1

The bug appears when trying to set hydra.main() decorator on register_pipelines(). This decorator is used to build an Omegaconf config from the /conf directory. Steps to reproduce :

  • Setup the iris_dataset toy project
  • Add the files required by Hydra in the conf folder (config.yaml and base/master.yaml):
    • conf/config.yaml:
      defaults:
        - base: master
      
    • conf/base/master.yaml:
       defaults:
         - ./catalog
         - ./logging
         - ./parameters
      
  • Rewrite existing files extension (yml->yaml)
  • Add the hydra.main decorator in src/[package_name]/pipeline_registry.py :
import hydra
@hydra.main(config_path="../../conf", config_name="config")
def register_pipelines(cfg: DictConfig) -> Dict[str, Pipeline]:
    """Register the project's pipelines.

    Returns:
        A mapping from a pipeline name to a ``Pipeline`` object.

    """

    data_engineering_pipeline = de.create_pipeline()
    data_science_pipeline = ds.create_pipeline()

    return {
        "de": data_engineering_pipeline,
        "ds": data_science_pipeline,
        "__default__": data_engineering_pipeline + data_science_pipeline,
    }

This will result in the following error:

Primary config module 'get_started.conf' not found.
Check that it's correct and contains an __init__.py file

note: get_started is the name of the package in /src

Cause of the issue

After some digging, it appears that the configuration path resolved by hydra.main does not exist. The following info is obtained by running in debug mode, and setting a breakpoint on the first line of the function ensure_main_config_source_available(). Full path: hydra/_internal/config_loader_impl.py/ConfigLoaderImpl.ensure_main_config_source_available()

  • When the bug appears, calling self.get_sources() while being in ConfigLoaderImpl.ensure_main_config_source_available returns this : [provider=hydra, path=pkg://hydra.conf, provider=main, path=pkg://conf, provider=schema, path=structured://]
  • It should actually be this : [provider=hydra, path=pkg://hydra.conf, provider=main, path=file:///PATH_TO_PROJECT//conf, provider=schema, path=structured://] It appears that Hydra doesn’t know how to get file:///PATH_TO_PROJECT, and replaces it by pkg://

Possible Implementation

Not really sure how to solve and which library should be adapted to correct this bug, so I wrote a similar post on Hydra’s issues tracker. Hydra requires that the script is launched by calling it manually in the terminal, and I don’t know what happens when executing kedro run but I guess it comes from somewhere here.

Possible Alternatives

Right now i’m using a workaround by generating the conf via initialize() and compose() :

from hydra import compose, initialize

def register_pipelines() -> Dict[str, Pipeline]:
    """Register the project's pipelines.

    Returns:
        A mapping from a pipeline name to a ``Pipeline`` object.

    """
    initialize(config_path="../../conf")
    cfg = compose(config_name="config")

    data_engineering_pipeline = de.create_pipeline()
    data_science_pipeline = ds.create_pipeline()

    return {
        "de": data_engineering_pipeline,
        "ds": data_science_pipeline,
        "__default__": data_engineering_pipeline + data_science_pipeline,
    }

Follow up question

This also raises the question on how to do config overrides from the command line, a feature of Hydra possible when the user calls the script himself from command line. I guess it would be possible via the --config argument of kedro run, but I haven’t tested it yet.

tl;dr: hydra.main() called in a unusual way, leading to the impossibility for Hydra to find the config folder.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:4
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
neltacigrebcommented, Apr 29, 2022

Hi @noklam sorry for the delay.

hydra is heavily based on its CLI and the compose API only has a subset of features

This is true, using hydra via CLI allows to override parameters at runtime, or launch it in multirun mode (one command fires all configs). Also the CLI mode creates a new output folder for each run, which proves useful in multirun. The compose API can only be used to create the config from the yaml files. In both cases, the config is accessible in the code directly, for example accessible from the register_pipeline() function. Here since hydra does not have direct access to the CLI, I made some adapter([repo(https://github.com/neltacigreb/kedroXhydra) ) to be able to try to test the 2 packages together.

Hydra probably has a different way of looking at how configurations should be structure and this could be related to https://github.com/kedro-org/kedro/issues/891.

Correct me if needed but I feel that the main difference is that kedro aims at simplicity in the config directory, while hydra encourage more complex config folder structures, so to make use of the override mechanism. They’re similar on some subjects too (multirun, dynamic pipelines, overrides), where some are already provided in the kedro config

As I continue using the two packages, I’ll focus on 2 features that could be a match in my opinion:

  • multirun/sweep with optuna
    • run multiple pipelines in parallel and search optimal parameters
  • function/classes instanciation from the config
    • Using any kind of dataloader as provided
    • Building pipelines from the config
  • Access to the config in the register_pipelines function:
    • Dynamic building of pipelines
  • Config overrides from cli:
    • mixing a general project config with run-specific configs

When I find some time i’ll package my findings in a plugin 😃 until then if you think of some features that could be used into kedro I’d be glad to try them as well

1reaction
neltacigrebcommented, Apr 29, 2022

For the compose API, that’s it exactly. In the repo I mentioned, their are 2 hydra decorators adapters.

  • One uses only the compose API and makes the config available in pipelines_registry()
  • another adapter wraps the hydra.main decorator and allows config overrides from the CLI, but I have never tested the multirun part, and it will certainly not work due to the way kedro handles the running of pipelines.

My plan to make the multirun usable, is to generate many kedro pipelines with different configurations, namespace them and assemble them in a big final pipeline.

I didn’t know about the multirun hook’s I’ll look into that first to see if it fits my app

Read more comments on GitHub >

github_iconTop Results From Across the Web

Configuration — Kedro 0.18.4 documentation - Read the Docs
Below is an example of a catalog.yml file that uses both features: ... Parameters project configuration can be loaded with the help of...
Read more >
Is there an alternative to DVC pipelines to create a DAG which ...
My question is: is there any alternative to DVC Pipelines which also goes well with hydra? pipeline · directed-acyclic-graphs · dvc · hydra-core....
Read more >
Metaflow/community - Gitter
Hey guys, I'm aware of the resources decorator for memory, CPU and GPU requests when running on AWS Batch but was wondering how...
Read more >
Jungle Scout case study: Kedro, Airflow, and MLFlow use on ...
The plan was to support nine other marketplaces, which would mean 18 hours of revisions for every release added to our workflow. We...
Read more >
Tools for Best Python Practices
@hydra.main(config_name='config.yaml') Process data1 Drop features: ['iid', 'id', 'idg', ... from kedro.pipeline import node, Pipeline from kedro.io import ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found