question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Re-design io.core and io.data_catalog

See original GitHub issue

Spun out of https://github.com/kedro-org/kedro/issues/1691#issuecomment-1176679924… Let’s collect ideas here on what current problems are with io. To me it feels like we’ve neglected it and it’s ripe for a re-design.

Note. Like the configuration overhaul, some of this would be non-breaking (behind the scenes implementation) and some would be breaking (changes to public API). In theory we’re free to do as we please with any function starting with _, i.e. we can remove, rename, changes arguments at will. In practice, however, some of these might be very commonly used (e.g. self._get_load_path() used in lots of datasets) so should not be regarded as non-breaking.


#1691 and #1580 are actually just symptoms of a more fundamental underlying issue: the API and underlying workings of io.core and io.data_catalog are very confusing and should be rethought in general. These are very old components in kedro and maybe some of the decisions that were originally made about their design should be revised. I think there’s also very likely to be old bits of code there that could now be removed or renamed (e.g. who would guess that something named add_feed_dict is used to add parameters to the catalog?). It feels like tech debt rather than intentional design currently.

I don’t think they’re massively wrong as it stands, but I think it would be a good exercise to go through them and work out exactly what functionality we should expose in the API and how we might like to rework them. e.g. in the case raised here there is quite a bit of confusion about how to get the filepath:

  • catalog.datasets is presumably the “official” route to get a dataset rather than _get_dataset, but catalog.datasets wouldn’t allow a namespaced datasets to be accessed without doing getattr. There’s some very subtle and non-obvious differences between datasets and _get_dataset, and then there’s also catalog._data_sets (which I think might just be a historical leftover… but not sure). In https://github.com/kedro-org/kedro/pull/1795 @jmholzer used vars(catalog.datasets)[dataset_name].
  • it also seems at a glance that _filepath is only defined for versioned datasets (? seems weird)
  • to actually get the correct versioned filepath it’s even harder - in our datasets we do get_filepath_str(self._get_load_path(), self._protocol) which is pretty obscure. Similar to #1654

So I think we should look holistically at the structures involved here and work out what the API should look like so there’s one, clear way to access the things that people need to access. I actually don’t think this is such a huge task. Then we can tell much more easily whether we need any new functionality in these structures (like a catalog.dumps) or whether it’s just a case of making what we already have better organised, documented and clearer.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
jmholzercommented, Sep 5, 2022

A really important issue IMO and a great write-up.

catalog.datasets is presumably the “official” route to get a dataset rather than _get_dataset, but catalog.datasets wouldn’t allow a namespaced datasets to be accessed without doing getattr. There’s some very subtle and non-obvious differences between datasets and _get_dataset, and then there’s also catalog._data_sets (which I think might just be a historical leftover… but not sure).

I think the class (_FrozenDatasets) that catalog.datasets returns an object of is a good candidate for refactoring:

  1. It has no simple interface for the datasets it contains. Currently, the only linter-friendly ways are to use vars(catalog.datasets)[dataset_name] or catalog.datasets.__dict__[dataset_name]. I don’t feel this level of (read) encapsulation is merited for an object assigned to a public attribute.
  2. The class is poorly documented; it would be good to have docstrings as the purpose of this class is not easy to grok.
  3. There is too much is going on inside __init__, delegating most of this to a few new, well-documented methods would also make this class much easier to understand.
3reactions
AntonyMilneQBcommented, Aug 10, 2022

@noklam also commented that we should consider what actually belongs to AbstractDataSet and what belongs to the implementations. Just to bring @noklam’s comment to life a bit more, since it’s something I’ve often thought about in the past too. We have the following bit of code repeated 20 times throughout our datasets:

    def _release(self) -> None:
        super()._release()
        self._invalidate_cache()

    def _invalidate_cache(self) -> None:
        """Invalidate underlying filesystem caches."""
        filepath = get_filepath_str(self._filepath, self._protocol)
        self._fs.invalidate_cache(filepath)

and the following is repeated 37 times:

load_path = get_filepath_str(self._get_load_path(), self._protocol)

_release is not an abc.abstractmethod so doesn’t have to be supplied. Why does it exist separately in so many datasets? Why do we need to access so many protected members (e.g. self._fs, self._get_load_path(), etc.)?

Is there anything we can do to make it easier to define a custom dataset? e.g. why is _describe a required method. Overall it feels to me like we have more boilerplate in dataset implementations than we really need.

Read more comments on GitHub >

github_iconTop Results From Across the Web

kedro.io.DataCatalog — Kedro 0.18.4 documentation
DataCatalog stores instances of AbstractDataSet implementations to provide load and save ... from kedro.extras.datasets.pandas import CSVDataSet io ...
Read more >
Open Sourcing Amundsen: A Data Discovery And Metadata ...
Our roadmap currently includes: Search & Resource page UI/UX redesign; Email notifications system; Indexing Dashboards (Superset, Mode Analytics ...
Read more >
Collibra: Data Catalog, Data Governance & Data Quality ...
Get your enterprise United by Data: Data intelligence solutions include Data Catalog, Data Governance, Data Lineage, Data Quality & Observability & more.
Read more >
.new - Google Registry
adalo.new Adalo Build a personalized app for mobile and desktop using Adalo. ai.new ai.new Create code using Python. blockchain.new Alchemy Build a blockchain app using...
Read more >
Data on the Web Best Practices - W3C
pav, http://pav-ontology.github.io/pav/, Provenance, ... For example, Dublin Core Metadata (DCMI) terms [ DCTERMS ] and Data Catalog ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found