question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Design auto-registration of pipelines

See original GitHub issue

Following a discussion in backlog grooming, the idea of auto-registering pipelines met with general approval so this is a ticket to design how to do it. See https://github.com/kedro-org/kedro/issues/1078 for original context and motivation.

The end goal

When I do kedro pipeline create it creates the following structure:

pipelines
├── __init__.py
├── a
│   ├── README.md
│   ├── __init__.py      # exposes __all__ = ["create_pipeline"]
│   ├── nodes.py
│   └── pipeline.py      # contains def create_pipeline
├── b
│   ├── ...
└── c
    ├── ...

Assuming they’re following the above structure, a user should be able to run kedro run --pipeline=a without needing to edit pipeline_registry.py at all. kedro run should run all pipelines, i.e. we have __default__ = a + b + c. It should be possible for a user to overwrite these automatic registrations if they want to by editing pipeline_registry.py as they can now.

Ultimately the above structure should result in a pipeline_registry.py that acts like the following (but does not actually have this code):

from spaceflights.pipelines import a, b, c

def register_pipelines(self) -> Dict[str, Pipeline]:
    a = a.create_pipeline()
    b = b.create_pipeline()
    c = c.create_pipeline()

    return {
        "__default__": a + b + c,
        "a": a,
        "b": b,
        "c": c,
    }

Proposed implementation

Something that is very roughly like this:

def get_default_registered_pipelines(): 
    for pipeline in Path("pipelines").iterdir():
        importlib.import_module(pipeline)
        if hasattr(pipeline, "create_pipeline"):
            registered_pipelines[pipeline] = pipeline.create_pipeline()
    registered_pipelines["__default__"] = sum(registered_pipelines.values())    # it's cool we can do this now
    return registered_pipelines


# pipeline_registry.py
def register_pipelines() -> Dict[str, Pipeline]:
    return get_default_registered_pipelines()

Then, if wanted, a user could change the default behaviour like this:

def register_pipelines():
    defaults = get_default_registered_pipelines()
    defaults["a"] = a1.create_pipeline() + a2.create_pipeline()
    my_other_pipeline_definitions = {"d": a.create_pipeline() + b.create_pipeline()}
    return {**defaults, **my_other_pipeline_definitions}

Questions: Where should get_default_registered_pipelines go? The Zen of Kedro says A sprinkle of magic is better than a spoonful of it, which suggests maybe it goes in pipeline_registry.py itself. But maybe it’s confusing for a user to have this weird looking code in such a core user-facing file (like hooks.py seemed to me when I first saw it)? So maybe better to have it defined on framework-side and then done as import kedro.pipeline... instead?

Alternative implementations

  • Something clever in kedro.project._ProjectPipelines that automatically registers pipelines. Sounds a bit too magical to me though - I prefer the explicitness of the above.
  • Actually edit the code in pipeline_registry.py to rewrite the Python dictionary when you do kedro pipeline create. Sounds totally horrible though.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
AntonyMilneQBcommented, Jun 29, 2022

Notes from technical design on 29 June:

  • concerns that if we didn’t make autoregister_pipelines the default then beginner users wouldn’t be aware of its existence and would still have to alter code in settings.py to activate it
  • this is in tension with setting PIPELINE_REGISTRY_FUNCTION = autoregister_pipelines by default, which would be a breaking change
  • @noklam questioned how people would be be able to do creative things in their register_pipelines function, e.g. dynamically generate them. I think we should consider whether register_pipelines takes some argument like extra_params (as it used to with hooks) or whatever happens with “metaparameters” as part of config reworking. I have several examples where this would be useful.
  • @MerelTheisenQB thought removing pipeline_registry.py file might be a good idea

Conclusion:

  • forget about settings.py for now and just implement this through a framework-side autoregister_pipelines function + editing pipeline_registry.py
  • hence current behaviour of kedro using pipeline_registry.register_pipeline remains, no new options in settings.py
  • what changes is that the default project template register_pipeline function will, instead of being empty, import and use autoregister_pipelines
  • this support my idea of composition above (i.e. call autoregister_pipelines and then modify the dictionary it returns)
  • new kedro projects will then use autoregister_pipelines by default (modify all starters to use it); existing kedro projects will remain unchanged but people can modify their pipeline_registry to use autoregister_pipelines if they like
  • this is completely non-breaking and we get the feature in users’ hands more quickly
  • nice and explicit, and neither beginner nor advanced users need to touch settings.py for now

Questions: in the future would we still add the settings.py option and/or remove pipeline_registry.py?

1reaction
idanovcommented, Apr 20, 2022

Suggestion number 2 in the previous comment seems most useful, although I would make the default function be the current one in pipeline_registry.py, but you could turn on/off autoregistering the pipleines by changing it to a built-in function for autoregistry. I don’t think it is a good idea to remove pipeline_registry.py by default, since it’s the entrypoint to the application and most users will look for it, unlike cli.py which only advanced users change. So my preferred behaviour would be:

  1. Add a PIPELINE_REGISTRY_FUNCTION, which is <package>.pipeline_registry.register_pipelines by default (i.e. the same as the current behaviour)
  2. Provide a helper function, called autoregister_pipelines, which could be imported and used in settings.py to be set as PIPELINE_REGISTRY_FUNCTION, and which will do what your get_default_registered_pipelines is doing and call <package>.pipeline_registry.register_pipelines at the end or…
  3. Alternatively, PIPELINE_REGISTRY_FUNCTION can take an array of functions and will merge their result at the end (with clear overriding order), e.g. people can set it to PIPELINE_REGISTRY_FUNCTION = [ autoregister_pipelines, register_pipelines ]

Number 3 seems very powerful and very simple to implement.

Read more comments on GitHub >

github_iconTop Results From Across the Web

M11 Steel Pipe: A Guide for Design and Installation, Fifth Edition
This manual of best practices provides complete information for designing, installing, and maintaining steel pipe and fittings for potable water ...
Read more >
Pipeline Construction and PHMSA
PHMSA has established regulations governing aspects of pipeline design and construction and conducts inspections of pipelines under ...
Read more >
Auto-registration of pipelines #1078 - kedro-org/kedro - GitHub
My (completely unsubstantiated, uninformed by any user interview) guess is that in 90% of projects that use modular pipelines the structure will ...
Read more >
Pipeline Design Workshop - Southern Gas Association
This 3-day workshop covers the entire pipeline design process, from pipe sizing to post-construction record keeping. ... Registration is now closed ...
Read more >
Create a new pipe run - Intergraph Smart 3D - Administration
If you select an equipment nozzle that is correlated with P&ID design basis data, the software automatically determines which run should be connected....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found