Re-design io.core and io.data_catalog
See original GitHub issueSpun out of https://github.com/kedro-org/kedro/issues/1691#issuecomment-1176679924… Let’s collect ideas here on what current problems are with io
. To me it feels like we’ve neglected it and it’s ripe for a re-design.
Note. Like the configuration overhaul, some of this would be non-breaking (behind the scenes implementation) and some would be breaking (changes to public API). In theory we’re free to do as we please with any function starting with
_
, i.e. we can remove, rename, changes arguments at will. In practice, however, some of these might be very commonly used (e.g.self._get_load_path()
used in lots of datasets) so should not be regarded as non-breaking.
#1691 and #1580 are actually just symptoms of a more fundamental underlying issue: the API and underlying workings of io.core
and io.data_catalog
are very confusing and should be rethought in general. These are very old components in kedro and maybe some of the decisions that were originally made about their design should be revised. I think there’s also very likely to be old bits of code there that could now be removed or renamed (e.g. who would guess that something named add_feed_dict
is used to add parameters to the catalog?). It feels like tech debt rather than intentional design currently.
I don’t think they’re massively wrong as it stands, but I think it would be a good exercise to go through them and work out exactly what functionality we should expose in the API and how we might like to rework them. e.g. in the case raised here there is quite a bit of confusion about how to get the filepath:
catalog.datasets
is presumably the “official” route to get a dataset rather than_get_dataset
, butcatalog.datasets
wouldn’t allow a namespaced datasets to be accessed without doinggetattr
. There’s some very subtle and non-obvious differences betweendatasets
and_get_dataset
, and then there’s alsocatalog._data_sets
(which I think might just be a historical leftover… but not sure). In https://github.com/kedro-org/kedro/pull/1795 @jmholzer usedvars(catalog.datasets)[dataset_name]
.- it also seems at a glance that
_filepath
is only defined for versioned datasets (? seems weird) - to actually get the correct versioned filepath it’s even harder - in our datasets we do
get_filepath_str(self._get_load_path(), self._protocol)
which is pretty obscure. Similar to #1654
So I think we should look holistically at the structures involved here and work out what the API should look like so there’s one, clear way to access the things that people need to access. I actually don’t think this is such a huge task. Then we can tell much more easily whether we need any new functionality in these structures (like a catalog.dumps
) or whether it’s just a case of making what we already have better organised, documented and clearer.
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:6 (6 by maintainers)
Top GitHub Comments
A really important issue IMO and a great write-up.
I think the class (
_FrozenDatasets
) thatcatalog.datasets
returns an object of is a good candidate for refactoring:vars(catalog.datasets)[dataset_name]
orcatalog.datasets.__dict__[dataset_name]
. I don’t feel this level of (read) encapsulation is merited for an object assigned to a public attribute.__init__
, delegating most of this to a few new, well-documented methods would also make this class much easier to understand.@noklam also commented that we should consider what actually belongs to
AbstractDataSet
and what belongs to the implementations. Just to bring @noklam’s comment to life a bit more, since it’s something I’ve often thought about in the past too. We have the following bit of code repeated 20 times throughout our datasets:and the following is repeated 37 times:
_release
is not anabc.abstractmethod
so doesn’t have to be supplied. Why does it exist separately in so many datasets? Why do we need to access so many protected members (e.g.self._fs
,self._get_load_path()
, etc.)?Is there anything we can do to make it easier to define a custom dataset? e.g. why is
_describe
a required method. Overall it feels to me like we have more boilerplate in dataset implementations than we really need.