Cannot override packed `*args` or `**kwargs` to an instance of a modular pipeline using the `pipeline()` wrapper
See original GitHub issueDescription
Passing List of dataframes as input to modular pipelines leads to error as TypeError: unhashable type: 'list'
.
While this works fine when not using modular pipelines as in having a pipeline calling the node directly without going via a modular pipeline.
Context
We have a function where will like to pass in a dynamic number of dataframes with parameters, which we are not able to do as of now due to this bug.
Steps to Reproduce
- Have a function:
def f(params, *dfs):
# combine dataframes
return combined_df
- Create Modular pipeline as:
def create_pipeline():
return Pipeline(
[node(func=f, inputs = ["params:xyz", "df_inputs"], outputs="xyz_df")]
)
- Call the modular pipeline using:
final_pipeline = pipeline(
modular_pipeline,
inputs={"df_inputs": ["df_1", "df_2"]},
outputs={"xyz_df":"combined_df"}
parameters={"params:xyz": "params:df_combine_params"},
)
Expected Result
Must get a pyspark dataframe.
Actual Result
TypeError: unhashable type: 'list'
Extra information.
The above setup works just fine, if your functions doesn’t needs params and can just work with list of dataframes. Calling that function via. modular pipelines works just fine.
So with a function like:
def f(*dfs):
# combine dataframes
return combined_df
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedro
orkedro -V
): kedro, version 0.17.6 - Python version used (
python -V
): Python 3.8.12 - Operating system and version: macOS Big Sur
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Modular pipelines — Kedro 0.18.4 documentation
The pipeline() wrapper method takes the following arguments: ... Any overrides provided to this instance of the underlying wrapped Pipeline object. outputs.
Read more >How to wrap a function with accurate __code__.co_argcount?
After a lot of tinkering, I found a pretty procedural way that might work for you. The trick was to use __code__.replace() ....
Read more >about Splatting - PowerShell | Microsoft Learn
This example shows how to override a splatted parameter using explicitly defined parameters. This is useful when you don't want to build a...
Read more >Ruby 3.0 changes - Ruby Documentation
Procs with “rest” arguments and keywords: change of autosplatting behavior. Just a leftover from the separation of keyword arguments. Discussion ...
Read more >Ray AIR API — Ray 2.2.0 - the Ray documentation
Preprocessors are stateful objects that can be fitted against a Dataset and used to transform both local data batches and distributed datasets. For...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
With that in mind I’m not sure it’s a ‘bug’ it sort of falls into the ‘not supported’ category - but I would like to see this feature
As per @datajoely I’m not sure this is exactly a bug, more a feature that doesn’t exist. But I don’t understand this:
… since I don’t think it works without params either. Please could you give a concrete example where this does work as you would like it to?
I’m going to use a simpler toy example, since there’s nothing specific to dataframes here; it’s just about node and function definitions.
This works fine, showing you can have a pipeline with variable number of arguments:
The question is what we should do if you want to transform
p1
usingpipeline
. The fundamental problem is thatinputs
(andoutputs
andparameters
) are used for a one to one mapping of dataset name to dataset name. As I understand it, what you want to do is map from one to many. This is quite a change since transforming a pipeline no longer becomes a question of changing the names of inputs/outputs but actually changes the pipeline structure to some extent. I do see when this would be useful, but I think it needs some careful thought.For now, I actually think there’s two ways to achieve something similar already.
1. Use placeholder datasets and then transform the ones you want
Change the above to:
You can then do this:
It’s pretty hacky, but if you just keep
p1
as a “template” pipeline that you don’t use by itself then I think this will work pretty well. You need to make sure there’s always enough inputs inp1
to cater for the n inputs ofp2
, and make sure there’s logic in the node that will distinguish between the placeholder datasets and the real ones (here I just check if thearg is not None
).2. Don’t use
pipeline
to do the transformationInstead you can achieve something similar by using a parametrised
create_pipeline
function:Note you can still namespace this pipeline by as
pipeline(create_pipeline(), namespace=...)
.