Make kedro-datasets a dependency of kedro?
See original GitHub issueMy latest proposal in the ongoing discussion of kedro-datasets. See https://github.com/kedro-org/kedro/issues/1758 for context.
Proposal
- We continue with
kedro-datasets
becoming a separate namespace package as we are currently in https://github.com/kedro-org/kedro-plugins/pull/49 - We move almost everything (see below for what) in
kedro.io
tokedro-datasets
also. This would be another namespace package, so import paths would remain the same askedro.io
kedro-datasets
becomes a core dependency ofkedro
. i.e.pip install kedro
also doespip install kedro-datasets
- Extra dependencies like
pandas.CSVDataSet
are defined now. We implementextras_require
in kedro so thatpip install kedro[pandas.CSVDataSet]
also doespip install kedro-datasets[pandas.CSVDataSet]
To be clear, the structure of kedro-datasets
would be:
kedro
├── datasets
│ ├── __init__.py
│ └── all the stuff that's there now in https://github.com/kedro-org/kedro-plugins/pull/49
└── io
├── __init__.py
├── cached_dataset.py
├── core.py
└── ...
# no __init__.py as this level!
Pros
- Users have the same flow as they currently have, which is nice and simple:
pip install kedro[pandas.CSVDataSet]
. They don’t need to worry about the existence ofkedro-datasets
apart from if they want to manually update to a version beyond the “officially supported” one specified in the kedro requirements - In theory you could
pip install kedro-datasets
and use the datasets by themselves (probably). In practice this might not happen much, but having a self-contained fully-functioning package feels like a clean way to split things up - Definitions of
AbstractDataSet
etc. live in the same place as the implementations. e.g. if we changeAbstractDataSet
(which I think we should in the not too distant future: see https://github.com/kedro-org/kedro/issues/1778) then this is much easier to handle - We break the circular dependencies problem of https://github.com/kedro-org/kedro/issues/1758 because point (1) no longer holds:
kedro-datasets
does not depend onkedro
- Documentation problem of https://github.com/kedro-org/kedro/issues/1651 becomes simpler, since all we need to check is which version of
kedro-datasets
to fetch based on kedro’s requirements.txt
Cons
- Arguably things like
AbstractDataSet
are closer to core components (e.g. runner, pipeline) than they are to dataset implementations, which are really the optional extra thing that should be split out. - Probably some other things we haven’t thought of yet
Key questions
- Why was all the stuff in
kedro.io
supposed to remain inkedro
rather than move tokedro-datasets
? How important is that? - Do the files I suggest moving make sense? Did I miss any dependencies here? I only glanced through to get a rough idea; would be good to do it more carefully.
- What did I miss here?
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Dependencies — Kedro 0.18.4 documentation - Read the Docs
The Data Catalog is your way of interacting with different data types in Kedro. The modular dependencies in this category include pandas ,...
Read more >Dependencies — Kedro 0.17.7 documentation - Read the Docs
When we introduced Kedro, we touched briefly on how to specify a project's dependencies to make it easier for others to run your...
Read more >Set up the data — Kedro 0.18.4 documentation - Read the Docs
Register the datasets with the Kedro Data Catalog in conf/base/catalog.yml , which is the registry of all data sources available for use by...
Read more >Create a data processing pipeline - Kedro - Read the Docs
When Kedro runs the pipeline, it determines that neither dataset is registered in the data catalog, so it stores these as temporary datasets...
Read more >kedro.extras.datasets — Kedro 0.18.4 documentation
GeoJSONDataSet loads/saves data to a GeoJSON file using an underlying filesystem (eg: local, S3, GCS). HoloviewsWriter saves Holoviews objects to image file(s) ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We discussed this issue in a Technical Design session:
kedro-datasets
a dependency ofkedro
seems like a good idea.AbstractDataSet
tokedro-datasets
? What was @idanov 's reasoning to keep it in corekedro
?kedro.io
should move tokedro-datasets
. @AntonyMilneQB has already made a suggestion for this, but would like to get a second opinion. @noklam raised a good question above: “How will this work ifkedro.io.data_catalog
stays inkedro
andkedro.io.xxxx
stays inkedro-datasets
”?Following a discussion with @idanov, we are not going with this proposal. All of
io
will remain inkedro
and just the datasets will go tokedro-datasets
. Ivan thinks all implementations ofAbstractDataSet
should be treated like they are written by a third party and split out. If we changeAbstractDataSet
then all third parties (us + anyone else implementing a dataset) will need to modify to match the new interface.MemoryDataSet
and the like are “core” dataset implementations and so remain inkedro
. @Galileo-Galilei also made a very good point with his Question 2, point 2 in https://github.com/kedro-org/kedro/issues/1758 that would make the proposal here unworkable.The solution will instead be to make
kedro
a dependency ofkedro-datasets
and forget about the “nice to have”pip install kedro[...]
. See https://github.com/kedro-org/kedro/issues/1758 for more.