Design auto-registration of pipelines
See original GitHub issueFollowing a discussion in backlog grooming, the idea of auto-registering pipelines met with general approval so this is a ticket to design how to do it. See https://github.com/kedro-org/kedro/issues/1078 for original context and motivation.
The end goal
When I do kedro pipeline create
it creates the following structure:
pipelines
├── __init__.py
├── a
│ ├── README.md
│ ├── __init__.py # exposes __all__ = ["create_pipeline"]
│ ├── nodes.py
│ └── pipeline.py # contains def create_pipeline
├── b
│ ├── ...
└── c
├── ...
Assuming they’re following the above structure, a user should be able to run kedro run --pipeline=a
without needing to edit pipeline_registry.py
at all. kedro run
should run all pipelines, i.e. we have __default__ = a + b + c
. It should be possible for a user to overwrite these automatic registrations if they want to by editing pipeline_registry.py
as they can now.
Ultimately the above structure should result in a pipeline_registry.py
that acts like the following (but does not actually have this code):
from spaceflights.pipelines import a, b, c
def register_pipelines(self) -> Dict[str, Pipeline]:
a = a.create_pipeline()
b = b.create_pipeline()
c = c.create_pipeline()
return {
"__default__": a + b + c,
"a": a,
"b": b,
"c": c,
}
Proposed implementation
Something that is very roughly like this:
def get_default_registered_pipelines():
for pipeline in Path("pipelines").iterdir():
importlib.import_module(pipeline)
if hasattr(pipeline, "create_pipeline"):
registered_pipelines[pipeline] = pipeline.create_pipeline()
registered_pipelines["__default__"] = sum(registered_pipelines.values()) # it's cool we can do this now
return registered_pipelines
# pipeline_registry.py
def register_pipelines() -> Dict[str, Pipeline]:
return get_default_registered_pipelines()
Then, if wanted, a user could change the default behaviour like this:
def register_pipelines():
defaults = get_default_registered_pipelines()
defaults["a"] = a1.create_pipeline() + a2.create_pipeline()
my_other_pipeline_definitions = {"d": a.create_pipeline() + b.create_pipeline()}
return {**defaults, **my_other_pipeline_definitions}
Questions:
Where should get_default_registered_pipelines
go? The Zen of Kedro says A sprinkle of magic is better than a spoonful of it, which suggests maybe it goes in pipeline_registry.py itself. But maybe it’s confusing for a user to have this weird looking code in such a core user-facing file (like hooks.py
seemed to me when I first saw it)? So maybe better to have it defined on framework-side and then done as import kedro.pipeline...
instead?
Alternative implementations
- Something clever in
kedro.project._ProjectPipelines
that automatically registers pipelines. Sounds a bit too magical to me though - I prefer the explicitness of the above. - Actually edit the code in pipeline_registry.py to rewrite the Python dictionary when you do
kedro pipeline create
. Sounds totally horrible though.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Notes from technical design on 29 June:
autoregister_pipelines
the default then beginner users wouldn’t be aware of its existence and would still have to alter code insettings.py
to activate itPIPELINE_REGISTRY_FUNCTION = autoregister_pipelines
by default, which would be a breaking changeregister_pipelines
function, e.g. dynamically generate them. I think we should consider whetherregister_pipelines
takes some argument likeextra_params
(as it used to with hooks) or whatever happens with “metaparameters” as part of config reworking. I have several examples where this would be useful.Conclusion:
autoregister_pipelines
function + editing pipeline_registry.pypipeline_registry.register_pipeline
remains, no new options in settings.pyregister_pipeline
function will, instead of being empty, import and useautoregister_pipelines
autoregister_pipelines
and then modify the dictionary it returns)autoregister_pipelines
by default (modify all starters to use it); existing kedro projects will remain unchanged but people can modify theirpipeline_registry
to useautoregister_pipelines
if they likeQuestions: in the future would we still add the settings.py option and/or remove pipeline_registry.py?
Suggestion number 2 in the previous comment seems most useful, although I would make the default function be the current one in
pipeline_registry.py
, but you could turn on/off autoregistering the pipleines by changing it to a built-in function for autoregistry. I don’t think it is a good idea to removepipeline_registry.py
by default, since it’s the entrypoint to the application and most users will look for it, unlikecli.py
which only advanced users change. So my preferred behaviour would be:PIPELINE_REGISTRY_FUNCTION
, which is<package>.pipeline_registry.register_pipelines
by default (i.e. the same as the current behaviour)autoregister_pipelines
, which could be imported and used insettings.py
to be set asPIPELINE_REGISTRY_FUNCTION
, and which will do what yourget_default_registered_pipelines
is doing and call<package>.pipeline_registry.register_pipelines
at the end or…PIPELINE_REGISTRY_FUNCTION
can take an array of functions and will merge their result at the end (with clear overriding order), e.g. people can set it toPIPELINE_REGISTRY_FUNCTION = [ autoregister_pipelines, register_pipelines ]
Number 3 seems very powerful and very simple to implement.