[KED-933] Passing node parameters from `create_pipeline`
See original GitHub issueDescription
I have been through the tutorial and the docs and have found something which I think could be included as a new feature. It’s possible that there may be an alternative way to do this so please enlighten me if there is.
I wanted to define a reusable node that could be configured through function arguments and used in multiple stages of the pipeline. To my understanding, the parameters config doesn’t fit what I am trying to do because there will be multiple instances of the same node in the pipeline but with different arguments. The example I have at hand is a node which vectorizes some text based on an input column. I want to be able to define a node like:
def vectorize(data: pd.DataFrame, vectorizer: TfidfVectorizer, column: str):
documents = data[column]
return vectorize.transform(documents, vectorizer)
so that in create_pipeline
I can do:
pipeline = Pipeline(
[
...
node(vectorize,
["question_pairs", "vectorizer", "question1"],
"matrix1"),
node(vectorize,
["question_pairs", "vectorizer", "question2"],
"matrix1"),
...
]
)
where the third item in the input list is the column I wish to vectorize
Context
Allowing for such paramerisation of nodes will mean that nodes can be reused throughout the pipeline.
Possible Implementation
If I was to run this code, it would fail saying that “question1” or “question2” cannot be found in the DataCatalog. A possible implementation would be to allow any surplus parameters that are not defined in the DataCatalog to be passed in as function arguments. This would allow for the type of behaviour I am looking to implement with a reusable node.
Possible Alternatives
A possible alternative is to wrap the vectorize
function with functools.partial
and specify the column parameter to create a new partial object, but this fails because __name__
is undefined for a functools partial object.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:13 (9 by maintainers)
Top GitHub Comments
I would also like to see the ability to pass parameters directly to nodes. I agree that this would encourage code reuse. The resulting pipeline might look like this:
where the
my_node
function has a signature like:(I’d be happy to implement this myself and open a PR)
@uwaisiqbal Have you found a way of passing node parameters? I am trying to make it work what @tsanikgr posted here but I don’t really get it. I am just finding my way into python…
I’d need to put that somehow in my
run.py
to update the catalog with the parameters before a pipeline is run but I am not sure how to approach that. Would be glad If you can point me into a direction here 😃edit: I found the answer here:
I can just prefix the parameters like so:
and add following in parameters.yml