question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Modularize default argument handling for datasets

See original GitHub issue

Description

Near-identical code to handle default arguments is replicated in almost every dataset implementation. Worse still, functionality across said datasets is the same, but implementation is inconsistent.

Context

When I want to implement a new dataset, I look at existing datasets as a baseline for implementing my own. However, there are inconsistencies between these datasets, from the more minor (save_args handled after load_args for some datasets), to the slightly more significant (special casing where there are no default arguments on some datasets but not others) and worse (one case where arguments are evaluated for truthiness instead of is not None) (see https://github.com/quantumblacklabs/kedro/blob/0.14.1/kedro/contrib/io/azure/csv_blob.py#L109-L113 as an example representing several of the above). I don’t know which one to follow to maintain consistency across the codebase.

Possible Implementation

#15

By having DEFAULT_LOAD_ARGS/DEFAULT_SAVE_ARGS attributes, users can also see the defaults programmatically (with the caveat that this is a drawback if you consider the few cases where such arguments don’t apply, like no save on SqlQueryDataSet or in general on LambdaDataSet/MemoryDataSet).

Possible Alternatives

  • Create an intermediate abstract dataset class (or mixin?) so as to not modify AbstractDataSet and thereby only apply to those with load_args/save_args
  • Move default argument handling into a utility function and call it from each individual __init__ method (not preferred)

Checklist

Include labels so that we can categorise your issue:

  • Add a “Component” label to the issue
  • Add a “Priority” label to the issue

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
tsanikgrcommented, Jun 10, 2019

Hi @deepyaman , thanks a lot for pointing this out. It’s something I 've thought of proposing to fix quite a few times but was never a priority.

Another solution to consider would be to have default_save_args and default_load_args as class variables (as this is what they really are): (edit: This is actually what you did as well, sorry, somehow I missed that 🤦‍♂ )

class Base:
      default_save_args = {}
      default_load_args = {}

      @property
      def _load_args(self):
           return ({**self.default_load_args, **self._load_args_} 
                   if hasattr(self, '_load_args_') else {**self.default_load_args})

      @property
      def _save_args(self):
           return ({**self.default_save_args, **self._save_args_} 
                   if hasattr(self, '_save_args_') else {**self.default_save_args})

      @_load_args.setter
      def _load_args(self, load_args):
            self._load_args_ = load_args if load_args is not None else {}

      @_save_args.setter
      def _save_args(self, save_args):
            self._save_args_ = save_args if save_args is not None else {}

class Child(Base):
     default_save_args = {'index': False}

     def __init__(self, load_args=None, save_args=None):
          self._load_args = load_args
          self._save_args = save_args

So that:

In [7]: c = Child({'hi': 'there'}, {'extra': 1})

In [8]: c._load_args
Out[8]: {'hi': 'there'}

In [9]: c._save_args
Out[9]: {'index': False, 'extra': 1}

In [10]: c.default_save_args
Out[10]: {'index': False}

This would avoid the __init__ on the parent class, remove the pylint: disable=super-init-not-called or simplify the code for classes that don’t make use of defaults.

I also like you proposition of making the default_*_args a public attribute, its kind of hidden in the constructor atm, so less “magic” for our users

Btw, I m not sure if what I suggested is the right way Maybe wait and see what others also say? @idanov @tolomea

1reaction
yetudadacommented, Jun 10, 2019

Thank you so much for this @deepyaman! We’ll await feedback from @idanov on this and will get back to you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Modularize default argument handling for datasets #14 - GitHub
Description. Near-identical code to handle default arguments is replicated in almost every dataset implementation.
Read more >
Modular pipelines — Kedro 0.17.2 documentation
Modular pipelines should be registered and stitched together in a main (or __default__ ) pipeline located in src/new_kedro_project/pipeline_registry.py.
Read more >
Learn you a Kedro - Towards Data Science
In this article, I introduce Kedro, an open-source Python framework for creating reproducible, maintainable and modular data science code.
Read more >
Reusable and modular code with functions
Arguments can be given default values with an equal sign in the function declaration - we call these 'keyword' arguments. Any argument in...
Read more >
Data Processing Using Parallel Computing - Deep Lake
First two arguments are always default arguments containing: ... In order to modularize your dataset processing, it is often helpful to create functions...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found