question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Passing Extra Parameters to Custom Dataset

See original GitHub issue

Description

Hello there. I created a custom dataset to handle our Spark Delta Tables. The problem is that the custom dataset needs a replace-where string defining what partition should be overwritten after the data is generated inside the node. Catalog definition:

revenue:
  type: path.DeltaTableDataSet
  namespace: test
  table: revenue
  save_args:
    -

I can’t use the parameters inside the save_args key for the custom dataset because the replace values are also calculated during execution depending on other pipeline parameters, like DATE_START and LOOKBACK.

I tried to create a class to be the interface between the nodes and the custom catalog, this class holds the Spark Dataframe and extra values, but Kedro fails when trying to convert to a pickle: Node return:

return SparkPlan(
    df=revenue,
    replace_where=[ 
        f"date >= '{from_date}' and date <= '{date_end}'"
    ]
)

Custom Dataset save method:

def _save(self, plan: SparkPlan) -> None:
    """Saves data to the specified filepath"""
    logger = logging.getLogger(self._table_name)
    logger.info(plan.replace_where)

Error received:

kedro.io.core.DataSetError: Failed while saving data to data set MemoryDataSet().
cannot pickle '_thread.RLock' object

Questions:

  1. Is there a way to provide runtime values to the dataset together with the data?
  2. Could I put these values in the context and retrieve them inside the custom dataset?
    • I saw a method to load the current kedro context, but that method was removed.

Edit 1 - 2022-07-25:

The error above was happening because I typed the wrong dataset in the node outputs, so Kedro tried to save as a MemoryDataset. I solved the problem of sending extra parameters by using this SparkPlan wrapper around every save and load from my custom dataset.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
datajoelycommented, Jul 25, 2022

Hi @brendalf I’ve just realised this is possibly resolved by tweaking the copy_mode of the memory dataset when passed into the next node:

https://kedro.readthedocs.io/en/latest/_modules/kedro/io/memory_dataset.html

1reaction
datajoelycommented, Jul 25, 2022

Sorry - MemoryDataSet is used to dynamically pass data between nodes automatically, if you look at the implementation we automatically do this for native Spark dataframes:

image

So you can do this by explicitly declaring MemoryDataSets in the catalog.

I also think if you were to subclass our spark.SparkDataSet or spark.DeltaTableDataSet you would benefit from this too.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pass extra arguments to __getitem__ - vision - PyTorch Forums
Hi I have implemented a custom dataset class where images are retrieved based on numerous object attributes. This, unfortunately, doesn't work when I...
Read more >
Pass user specified parameters to DataLoader - Stack Overflow
DataLoader where the new init has an extra parameter accepting the user specified/learnable parameters. Given the source code of torch.utils.
Read more >
Passing Extra Parameters - MATLAB & Simulink - MathWorks
The extra parameters can be data, or can represent variables that do not change during the optimization. There are three methods of passing...
Read more >
Can dataset.map accept multiple arguments like python map
In datasets.map , we are required to pass in a callable (which expects objects of form dataset[idx] , which means that certain things...
Read more >
Create Dataset with Parameters - Trifacta Documentation
For more information on parameterization of datasets and other types ... parameters to the custom SQL that pulls the data from the source....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found