Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-933] Passing node parameters from `create_pipeline`

See original GitHub issue

Description

I have been through the tutorial and the docs and have found something which I think could be included as a new feature. It’s possible that there may be an alternative way to do this so please enlighten me if there is.

I wanted to define a reusable node that could be configured through function arguments and used in multiple stages of the pipeline. To my understanding, the parameters config doesn’t fit what I am trying to do because there will be multiple instances of the same node in the pipeline but with different arguments. The example I have at hand is a node which vectorizes some text based on an input column. I want to be able to define a node like:

def vectorize(data: pd.DataFrame, vectorizer: TfidfVectorizer, column: str):
    documents = data[column]
    return vectorize.transform(documents, vectorizer)

so that in create_pipeline I can do:

    pipeline = Pipeline(
        [
            ...
            node(vectorize,
                 ["question_pairs", "vectorizer", "question1"],
                 "matrix1"),
             node(vectorize,
                 ["question_pairs", "vectorizer", "question2"],
                 "matrix1"),
            ...
        ]
    )

where the third item in the input list is the column I wish to vectorize

Context

Allowing for such paramerisation of nodes will mean that nodes can be reused throughout the pipeline.

Possible Implementation

If I was to run this code, it would fail saying that “question1” or “question2” cannot be found in the DataCatalog. A possible implementation would be to allow any surplus parameters that are not defined in the DataCatalog to be passed in as function arguments. This would allow for the type of behaviour I am looking to implement with a reusable node.

Possible Alternatives

A possible alternative is to wrap the vectorize function with functools.partial and specify the column parameter to create a new partial object, but this fails because __name__ is undefined for a functools partial object.

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:13 (9 by maintainers)

Top GitHub Comments

5reactions

zacernstcommented, Jun 24, 2019

I would also like to see the ability to pass parameters directly to nodes. I agree that this would encourage code reuse. The resulting pipeline might look like this:

pipeline = Pipeline(
    [                                                                                                                                                                                                       
    node(
        my_node,
        {'kwarg_1': 'value_1', 'kwarg_2': 'value_2, ...},                                                                                                                                                                                          
        'data_catalog_entry_1',
        'data_catalog_entry_2',                                                                                                                             
    ),

where the my_node function has a signature like:

def my_node(df, kwarg_1=None, kwarg_2=None):
    ...something...

(I’d be happy to implement this myself and open a PR)

3reactions

OFrankecommented, Oct 10, 2019

@uwaisiqbal Have you found a way of passing node parameters? I am trying to make it work what @tsanikgr posted here but I don’t really get it. I am just finding my way into python…

I’d need to put that somehow in my run.py to update the catalog with the parameters before a pipeline is run but I am not sure how to approach that. Would be glad If you can point me into a direction here 😃

edit: I found the answer here:

I can just prefix the parameters like so:

node(
                split_data,
                ["master_table", "params:test_size", "params:random_state"],
                ["X_train", "X_test", "y_train", "y_test"],
            )

and add following in parameters.yml

test_size: 5
random_state: 1

Top Results From Across the Web

[KED-933] Passing node parameters from create_pipeline #22

A possible implementation would be to allow any surplus parameters that are not defined in the DataCatalog to be passed in as function...

How To Use a Parameter Range to Generate Pipelines ...

Data Engineering is a tough job, and it can be made tougher by complex, difficult to understand data pipelines. In this series, we...

Setting Parameters in kedro - Waylon Walker

Parameters are a place for you to store variables for your pipeline that can be accessed by any node that needs it, and...

How do I pass command line arguments to a Node.js program?

Standard Method (no library). The arguments are stored in process.argv. Here are the node docs on handling command line args:.

How To Pass Command-Line Arguments in Node.js (2022)

The process.argv property returns an array containing the command-line arguments passed when the Node.js process was launched. The first element will be ...