question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot override packed `*args` or `**kwargs` to an instance of a modular pipeline using the `pipeline()` wrapper

See original GitHub issue

Description

Passing List of dataframes as input to modular pipelines leads to error as TypeError: unhashable type: 'list'. While this works fine when not using modular pipelines as in having a pipeline calling the node directly without going via a modular pipeline.

Context

We have a function where will like to pass in a dynamic number of dataframes with parameters, which we are not able to do as of now due to this bug.

Steps to Reproduce

  1. Have a function:
def f(params, *dfs):
   # combine dataframes
   return combined_df
  1. Create Modular pipeline as:
def create_pipeline():
   return Pipeline(
    [node(func=f, inputs = ["params:xyz", "df_inputs"], outputs="xyz_df")]
)
  1. Call the modular pipeline using:
final_pipeline = pipeline(
        modular_pipeline,
        inputs={"df_inputs": ["df_1", "df_2"]},
        outputs={"xyz_df":"combined_df"}
        parameters={"params:xyz": "params:df_combine_params"},
    )

Expected Result

Must get a pyspark dataframe.

Actual Result

TypeError: unhashable type: 'list'

Extra information.

The above setup works just fine, if your functions doesn’t needs params and can just work with list of dataframes. Calling that function via. modular pipelines works just fine.

So with a function like:

def f(*dfs):
   # combine dataframes
   return combined_df

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.17.6
  • Python version used (python -V): Python 3.8.12
  • Operating system and version: macOS Big Sur

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
datajoelycommented, Feb 9, 2022

With that in mind I’m not sure it’s a ‘bug’ it sort of falls into the ‘not supported’ category - but I would like to see this feature

0reactions
AntonyMilneQBcommented, Feb 9, 2022

As per @datajoely I’m not sure this is exactly a bug, more a feature that doesn’t exist. But I don’t understand this:

The above setup works just fine, if your functions doesn’t needs params and can just work with list of dataframes. Calling that function via. modular pipelines works just fine.

… since I don’t think it works without params either. Please could you give a concrete example where this does work as you would like it to?

I’m going to use a simpler toy example, since there’s nothing specific to dataframes here; it’s just about node and function definitions.

This works fine, showing you can have a pipeline with variable number of arguments:

import string
from kedro.pipeline import pipeline, Pipeline, node
from kedro.runner import SequentialRunner
from kedro.io import DataCatalog, MemoryDataSet

# Create datasets a, b, c, ..., z containing each letter.
alphabet_lower = {letter: MemoryDataSet(letter) for letter in string.ascii_lowercase}

def f(*args):
    return "".join(args)

n = node(f, list(alphabet_lower), "output")
p1 = Pipeline([n])

io = DataCatalog(alphabet_lower)
print(SequentialRunner().run(p1, io)["output"])
# Outputs abc...z

The question is what we should do if you want to transform p1 using pipeline. The fundamental problem is that inputs (and outputs and parameters) are used for a one to one mapping of dataset name to dataset name. As I understand it, what you want to do is map from one to many. This is quite a change since transforming a pipeline no longer becomes a question of changing the names of inputs/outputs but actually changes the pipeline structure to some extent. I do see when this would be useful, but I think it needs some careful thought.

For now, I actually think there’s two ways to achieve something similar already.

1. Use placeholder datasets and then transform the ones you want

Change the above to:

alphabet_lower = {letter: MemoryDataSet(None) for letter in string.ascii_lowercase}
alphabet_upper = {letter: MemoryDataSet(letter) for letter in string.ascii_uppercase}
io = DataCatalog({**alphabet_lower, **alphabet_upper})

def f(*args):
    return "".join(arg for arg in args if arg is not None)

You can then do this:

p2 = pipeline(p1, inputs={"a": "A", "b": "B", "c": "C"})
print(SequentialRunner().run(p2, io)["output"])
# Outputs ABC

It’s pretty hacky, but if you just keep p1 as a “template” pipeline that you don’t use by itself then I think this will work pretty well. You need to make sure there’s always enough inputs in p1 to cater for the n inputs of p2, and make sure there’s logic in the node that will distinguish between the placeholder datasets and the real ones (here I just check if the arg is not None).

2. Don’t use pipeline to do the transformation

Instead you can achieve something similar by using a parametrised create_pipeline function:

def create_pipeline(args):
    return Pipeline([node(f, args, "output")])

p3 = create_pipeline(["A", "B", "C"])
print(SequentialRunner().run(p3, io)["output"])
# Outputs ABC

Note you can still namespace this pipeline by as pipeline(create_pipeline(), namespace=...).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Modular pipelines — Kedro 0.18.4 documentation
The pipeline() wrapper method takes the following arguments: ... Any overrides provided to this instance of the underlying wrapped Pipeline object. outputs.
Read more >
How to wrap a function with accurate __code__.co_argcount?
After a lot of tinkering, I found a pretty procedural way that might work for you. The trick was to use __code__.replace() ....
Read more >
about Splatting - PowerShell | Microsoft Learn
This example shows how to override a splatted parameter using explicitly defined parameters. This is useful when you don't want to build a...
Read more >
Ruby 3.0 changes - Ruby Documentation
Procs with “rest” arguments and keywords: change of autosplatting behavior. Just a leftover from the separation of keyword arguments. Discussion ...
Read more >
Ray AIR API — Ray 2.2.0 - the Ray documentation
Preprocessors are stateful objects that can be fitted against a Dataset and used to transform both local data batches and distributed datasets. For...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found