Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Modularize default argument handling for datasets

See original GitHub issue

Description

Near-identical code to handle default arguments is replicated in almost every dataset implementation. Worse still, functionality across said datasets is the same, but implementation is inconsistent.

Context

When I want to implement a new dataset, I look at existing datasets as a baseline for implementing my own. However, there are inconsistencies between these datasets, from the more minor (save_args handled after load_args for some datasets), to the slightly more significant (special casing where there are no default arguments on some datasets but not others) and worse (one case where arguments are evaluated for truthiness instead of is not None) (see https://github.com/quantumblacklabs/kedro/blob/0.14.1/kedro/contrib/io/azure/csv_blob.py#L109-L113 as an example representing several of the above). I don’t know which one to follow to maintain consistency across the codebase.

Possible Implementation

#15

By having DEFAULT_LOAD_ARGS/DEFAULT_SAVE_ARGS attributes, users can also see the defaults programmatically (with the caveat that this is a drawback if you consider the few cases where such arguments don’t apply, like no save on SqlQueryDataSet or in general on LambdaDataSet/MemoryDataSet).

Possible Alternatives

Create an intermediate abstract dataset class (or mixin?) so as to not modify AbstractDataSet and thereby only apply to those with load_args/save_args
Move default argument handling into a utility function and call it from each individual __init__ method (not preferred)

Checklist

Include labels so that we can categorise your issue:

Add a “Component” label to the issue
Add a “Priority” label to the issue

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:9 (9 by maintainers)

Top GitHub Comments

2reactions

tsanikgrcommented, Jun 10, 2019

Hi @deepyaman , thanks a lot for pointing this out. It’s something I 've thought of proposing to fix quite a few times but was never a priority.

Another solution to consider would be to have default_save_args and default_load_args as class variables (as this is what they really are): (edit: This is actually what you did as well, sorry, somehow I missed that 🤦‍♂ )

class Base:
      default_save_args = {}
      default_load_args = {}

      @property
      def _load_args(self):
           return ({**self.default_load_args, **self._load_args_} 
                   if hasattr(self, '_load_args_') else {**self.default_load_args})

      @property
      def _save_args(self):
           return ({**self.default_save_args, **self._save_args_} 
                   if hasattr(self, '_save_args_') else {**self.default_save_args})

      @_load_args.setter
      def _load_args(self, load_args):
            self._load_args_ = load_args if load_args is not None else {}

      @_save_args.setter
      def _save_args(self, save_args):
            self._save_args_ = save_args if save_args is not None else {}

class Child(Base):
     default_save_args = {'index': False}

     def __init__(self, load_args=None, save_args=None):
          self._load_args = load_args
          self._save_args = save_args

So that:

In [7]: c = Child({'hi': 'there'}, {'extra': 1})

In [8]: c._load_args
Out[8]: {'hi': 'there'}

In [9]: c._save_args
Out[9]: {'index': False, 'extra': 1}

In [10]: c.default_save_args
Out[10]: {'index': False}

This would avoid the __init__ on the parent class, remove the pylint: disable=super-init-not-called or simplify the code for classes that don’t make use of defaults.

I also like you proposition of making the default_*_args a public attribute, its kind of hidden in the constructor atm, so less “magic” for our users

Btw, I m not sure if what I suggested is the right way Maybe wait and see what others also say? @idanov @tolomea

1reaction

yetudadacommented, Jun 10, 2019

Thank you so much for this @deepyaman! We’ll await feedback from @idanov on this and will get back to you.

Top Results From Across the Web

Modularize default argument handling for datasets #14 - GitHub

Description. Near-identical code to handle default arguments is replicated in almost every dataset implementation.

Modular pipelines — Kedro 0.17.2 documentation

Modular pipelines should be registered and stitched together in a main (or __default__ ) pipeline located in src/new_kedro_project/pipeline_registry.py.

Learn you a Kedro - Towards Data Science

In this article, I introduce Kedro, an open-source Python framework for creating reproducible, maintainable and modular data science code.

Reusable and modular code with functions

Arguments can be given default values with an equal sign in the function declaration - we call these 'keyword' arguments. Any argument in...

Data Processing Using Parallel Computing - Deep Lake

First two arguments are always default arguments containing: ... In order to modularize your dataset processing, it is often helpful to create functions...