question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-933] Passing node parameters from `create_pipeline`

See original GitHub issue

Description

I have been through the tutorial and the docs and have found something which I think could be included as a new feature. It’s possible that there may be an alternative way to do this so please enlighten me if there is.

I wanted to define a reusable node that could be configured through function arguments and used in multiple stages of the pipeline. To my understanding, the parameters config doesn’t fit what I am trying to do because there will be multiple instances of the same node in the pipeline but with different arguments. The example I have at hand is a node which vectorizes some text based on an input column. I want to be able to define a node like:

def vectorize(data: pd.DataFrame, vectorizer: TfidfVectorizer, column: str):
    documents = data[column]
    return vectorize.transform(documents, vectorizer)

so that in create_pipeline I can do:

    pipeline = Pipeline(
        [
            ...
            node(vectorize,
                 ["question_pairs", "vectorizer", "question1"],
                 "matrix1"),
             node(vectorize,
                 ["question_pairs", "vectorizer", "question2"],
                 "matrix1"),
            ...
        ]
    )

where the third item in the input list is the column I wish to vectorize

Context

Allowing for such paramerisation of nodes will mean that nodes can be reused throughout the pipeline.

Possible Implementation

If I was to run this code, it would fail saying that “question1” or “question2” cannot be found in the DataCatalog. A possible implementation would be to allow any surplus parameters that are not defined in the DataCatalog to be passed in as function arguments. This would allow for the type of behaviour I am looking to implement with a reusable node.

Possible Alternatives

A possible alternative is to wrap the vectorize function with functools.partial and specify the column parameter to create a new partial object, but this fails because __name__ is undefined for a functools partial object.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:13 (9 by maintainers)

github_iconTop GitHub Comments

5reactions
zacernstcommented, Jun 24, 2019

I would also like to see the ability to pass parameters directly to nodes. I agree that this would encourage code reuse. The resulting pipeline might look like this:

pipeline = Pipeline(
    [                                                                                                                                                                                                       
    node(
        my_node,
        {'kwarg_1': 'value_1', 'kwarg_2': 'value_2, ...},                                                                                                                                                                                          
        'data_catalog_entry_1',
        'data_catalog_entry_2',                                                                                                                             
    ),

where the my_node function has a signature like:

def my_node(df, kwarg_1=None, kwarg_2=None):
    ...something...

(I’d be happy to implement this myself and open a PR)

3reactions
OFrankecommented, Oct 10, 2019

@uwaisiqbal Have you found a way of passing node parameters? I am trying to make it work what @tsanikgr posted here but I don’t really get it. I am just finding my way into python…

I’d need to put that somehow in my run.py to update the catalog with the parameters before a pipeline is run but I am not sure how to approach that. Would be glad If you can point me into a direction here 😃

edit: I found the answer here:

I can just prefix the parameters like so:

node(
                split_data,
                ["master_table", "params:test_size", "params:random_state"],
                ["X_train", "X_test", "y_train", "y_test"],
            )

and add following in parameters.yml

test_size: 5
random_state: 1
Read more comments on GitHub >

github_iconTop Results From Across the Web

[KED-933] Passing node parameters from create_pipeline #22
A possible implementation would be to allow any surplus parameters that are not defined in the DataCatalog to be passed in as function...
Read more >
How To Use a Parameter Range to Generate Pipelines ...
Data Engineering is a tough job, and it can be made tougher by complex, difficult to understand data pipelines. In this series, we...
Read more >
Setting Parameters in kedro - Waylon Walker
Parameters are a place for you to store variables for your pipeline that can be accessed by any node that needs it, and...
Read more >
How do I pass command line arguments to a Node.js program?
Standard Method (no library). The arguments are stored in process.argv. Here are the node docs on handling command line args:.
Read more >
How To Pass Command-Line Arguments in Node.js (2022)
The process.argv property returns an array containing the command-line arguments passed when the Node.js process was launched. The first element will be ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found