Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Passing Extra Parameters to Custom Dataset

See original GitHub issue

Description

Hello there. I created a custom dataset to handle our Spark Delta Tables. The problem is that the custom dataset needs a replace-where string defining what partition should be overwritten after the data is generated inside the node. Catalog definition:

revenue:
  type: path.DeltaTableDataSet
  namespace: test
  table: revenue
  save_args:
    -

I can’t use the parameters inside the save_args key for the custom dataset because the replace values are also calculated during execution depending on other pipeline parameters, like DATE_START and LOOKBACK.

I tried to create a class to be the interface between the nodes and the custom catalog, this class holds the Spark Dataframe and extra values, but Kedro fails when trying to convert to a pickle: Node return:

return SparkPlan(
    df=revenue,
    replace_where=[ 
        f"date >= '{from_date}' and date <= '{date_end}'"
    ]
)

Custom Dataset save method:

def _save(self, plan: SparkPlan) -> None:
    """Saves data to the specified filepath"""
    logger = logging.getLogger(self._table_name)
    logger.info(plan.replace_where)

Error received:

kedro.io.core.DataSetError: Failed while saving data to data set MemoryDataSet().
cannot pickle '_thread.RLock' object

Questions:

Is there a way to provide runtime values to the dataset together with the data?
Could I put these values in the context and retrieve them inside the custom dataset?
- I saw a method to load the current kedro context, but that method was removed.

Edit 1 - 2022-07-25:

The error above was happening because I typed the wrong dataset in the node outputs, so Kedro tried to save as a MemoryDataset. I solved the problem of sending extra parameters by using this SparkPlan wrapper around every save and load from my custom dataset.

Issue Analytics

State:
Created a year ago
Comments:16 (9 by maintainers)

Top GitHub Comments

2reactions

datajoelycommented, Jul 25, 2022

Hi @brendalf I’ve just realised this is possibly resolved by tweaking the copy_mode of the memory dataset when passed into the next node:

https://kedro.readthedocs.io/en/latest/_modules/kedro/io/memory_dataset.html

1reaction

datajoelycommented, Jul 25, 2022

Sorry - MemoryDataSet is used to dynamically pass data between nodes automatically, if you look at the implementation we automatically do this for native Spark dataframes:

So you can do this by explicitly declaring MemoryDataSets in the catalog.

I also think if you were to subclass our spark.SparkDataSet or spark.DeltaTableDataSet you would benefit from this too.

Top Results From Across the Web

Pass extra arguments to __getitem__ - vision - PyTorch Forums

Hi I have implemented a custom dataset class where images are retrieved based on numerous object attributes. This, unfortunately, doesn't work when I...

Pass user specified parameters to DataLoader - Stack Overflow

DataLoader where the new init has an extra parameter accepting the user specified/learnable parameters. Given the source code of torch.utils.

Passing Extra Parameters - MATLAB & Simulink - MathWorks

The extra parameters can be data, or can represent variables that do not change during the optimization. There are three methods of passing...

Can dataset.map accept multiple arguments like python map

In datasets.map , we are required to pass in a callable (which expects objects of form dataset[idx] , which means that certain things...

Create Dataset with Parameters - Trifacta Documentation

For more information on parameterization of datasets and other types ... parameters to the custom SQL that pulls the data from the source....