Passing Extra Parameters to Custom Dataset
See original GitHub issueDescription
Hello there. I created a custom dataset to handle our Spark Delta Tables. The problem is that the custom dataset needs a replace-where string defining what partition should be overwritten after the data is generated inside the node. Catalog definition:
revenue:
type: path.DeltaTableDataSet
namespace: test
table: revenue
save_args:
-
I can’t use the parameters inside the save_args
key for the custom dataset because the replace values are also calculated during execution depending on other pipeline parameters, like DATE_START and LOOKBACK.
I tried to create a class to be the interface between the nodes and the custom catalog, this class holds the Spark Dataframe and extra values, but Kedro fails when trying to convert to a pickle: Node return:
return SparkPlan(
df=revenue,
replace_where=[
f"date >= '{from_date}' and date <= '{date_end}'"
]
)
Custom Dataset save method:
def _save(self, plan: SparkPlan) -> None:
"""Saves data to the specified filepath"""
logger = logging.getLogger(self._table_name)
logger.info(plan.replace_where)
Error received:
kedro.io.core.DataSetError: Failed while saving data to data set MemoryDataSet().
cannot pickle '_thread.RLock' object
Questions:
- Is there a way to provide runtime values to the dataset together with the data?
- Could I put these values in the context and retrieve them inside the custom dataset?
- I saw a method to load the current kedro context, but that method was removed.
Edit 1 - 2022-07-25:
The error above was happening because I typed the wrong dataset in the node outputs, so Kedro tried to save as a MemoryDataset.
I solved the problem of sending extra parameters by using this SparkPlan
wrapper around every save and load from my custom dataset.
Issue Analytics
- State:
- Created a year ago
- Comments:16 (9 by maintainers)
Top GitHub Comments
Hi @brendalf I’ve just realised this is possibly resolved by tweaking the
copy_mode
of the memory dataset when passed into the next node:https://kedro.readthedocs.io/en/latest/_modules/kedro/io/memory_dataset.html
Sorry -
MemoryDataSet
is used to dynamically pass data between nodes automatically, if you look at the implementation we automatically do this for native Spark dataframes:So you can do this by explicitly declaring
MemoryDataSets
in the catalog.I also think if you were to subclass our
spark.SparkDataSet
orspark.DeltaTableDataSet
you would benefit from this too.