Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make kedro-datasets a dependency of kedro?

See original GitHub issue

My latest proposal in the ongoing discussion of kedro-datasets. See https://github.com/kedro-org/kedro/issues/1758 for context.

Proposal

We continue with kedro-datasets becoming a separate namespace package as we are currently in https://github.com/kedro-org/kedro-plugins/pull/49
We move almost everything (see below for what) in kedro.io to kedro-datasets also. This would be another namespace package, so import paths would remain the same as kedro.io
kedro-datasets becomes a core dependency of kedro. i.e. pip install kedro also does pip install kedro-datasets
Extra dependencies like pandas.CSVDataSet are defined now. We implement extras_require in kedro so that pip install kedro[pandas.CSVDataSet] also does pip install kedro-datasets[pandas.CSVDataSet]

To be clear, the structure of kedro-datasets would be:

kedro
├── datasets
│   ├── __init__.py
│   └── all the stuff that's there now in https://github.com/kedro-org/kedro-plugins/pull/49
└── io
    ├── __init__.py
    ├── cached_dataset.py
    ├──  core.py
    └── ...
# no __init__.py as this level!

Pros

Users have the same flow as they currently have, which is nice and simple: pip install kedro[pandas.CSVDataSet]. They don’t need to worry about the existence of kedro-datasets apart from if they want to manually update to a version beyond the “officially supported” one specified in the kedro requirements
In theory you could pip install kedro-datasets and use the datasets by themselves (probably). In practice this might not happen much, but having a self-contained fully-functioning package feels like a clean way to split things up
Definitions of AbstractDataSet etc. live in the same place as the implementations. e.g. if we change AbstractDataSet (which I think we should in the not too distant future: see https://github.com/kedro-org/kedro/issues/1778) then this is much easier to handle
We break the circular dependencies problem of https://github.com/kedro-org/kedro/issues/1758 because point (1) no longer holds: kedro-datasets does not depend on kedro
Documentation problem of https://github.com/kedro-org/kedro/issues/1651 becomes simpler, since all we need to check is which version of kedro-datasets to fetch based on kedro’s requirements.txt

Cons

Arguably things like AbstractDataSet are closer to core components (e.g. runner, pipeline) than they are to dataset implementations, which are really the optional extra thing that should be split out.
Probably some other things we haven’t thought of yet

Key questions

Why was all the stuff in kedro.io supposed to remain in kedro rather than move to kedro-datasets? How important is that?
Do the files I suggest moving make sense? Did I miss any dependencies here? I only glanced through to get a rough idea; would be good to do it more carefully.
What did I miss here?

Issue Analytics

State:
Created a year ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

merelchtcommented, Aug 25, 2022

We discussed this issue in a Technical Design session:

The team is in agreement that kedro-datasets a dependency of kedro seems like a good idea.
Things that need to be clarified:
- Why should we not move AbstractDataSet to kedro-datasets? What was @idanov 's reasoning to keep it in core kedro?
- What parts of kedro.io should move to kedro-datasets. @AntonyMilneQB has already made a suggestion for this, but would like to get a second opinion. @noklam raised a good question above: “How will this work if kedro.io.data_catalog stays in kedro and kedro.io.xxxx stays in kedro-datasets”?

1reaction

AntonyMilneQBcommented, Sep 1, 2022

Following a discussion with @idanov, we are not going with this proposal. All of io will remain in kedro and just the datasets will go to kedro-datasets. Ivan thinks all implementations of AbstractDataSet should be treated like they are written by a third party and split out. If we change AbstractDataSet then all third parties (us + anyone else implementing a dataset) will need to modify to match the new interface. MemoryDataSet and the like are “core” dataset implementations and so remain in kedro. @Galileo-Galilei also made a very good point with his Question 2, point 2 in https://github.com/kedro-org/kedro/issues/1758 that would make the proposal here unworkable.

The solution will instead be to make kedro a dependency of kedro-datasets and forget about the “nice to have” pip install kedro[...]. See https://github.com/kedro-org/kedro/issues/1758 for more.

Top Results From Across the Web

Dependencies — Kedro 0.18.4 documentation - Read the Docs

The Data Catalog is your way of interacting with different data types in Kedro. The modular dependencies in this category include pandas ,...

Dependencies — Kedro 0.17.7 documentation - Read the Docs

When we introduced Kedro, we touched briefly on how to specify a project's dependencies to make it easier for others to run your...

Set up the data — Kedro 0.18.4 documentation - Read the Docs

Register the datasets with the Kedro Data Catalog in conf/base/catalog.yml , which is the registry of all data sources available for use by...

Create a data processing pipeline - Kedro - Read the Docs

When Kedro runs the pipeline, it determines that neither dataset is registered in the data catalog, so it stores these as temporary datasets...

kedro.extras.datasets — Kedro 0.18.4 documentation

GeoJSONDataSet loads/saves data to a GeoJSON file using an underlying filesystem (eg: local, S3, GCS). HoloviewsWriter saves Holoviews objects to image file(s) ...