question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make kedro-datasets a dependency of kedro?

See original GitHub issue

My latest proposal in the ongoing discussion of kedro-datasets. See https://github.com/kedro-org/kedro/issues/1758 for context.

Proposal

  1. We continue with kedro-datasets becoming a separate namespace package as we are currently in https://github.com/kedro-org/kedro-plugins/pull/49
  2. We move almost everything (see below for what) in kedro.io to kedro-datasets also. This would be another namespace package, so import paths would remain the same as kedro.io
  3. kedro-datasets becomes a core dependency of kedro. i.e. pip install kedro also does pip install kedro-datasets
  4. Extra dependencies like pandas.CSVDataSet are defined now. We implement extras_require in kedro so that pip install kedro[pandas.CSVDataSet] also does pip install kedro-datasets[pandas.CSVDataSet]

To be clear, the structure of kedro-datasets would be:

kedro
├── datasets
│   ├── __init__.py
│   └── all the stuff that's there now in https://github.com/kedro-org/kedro-plugins/pull/49
└── io
    ├── __init__.py
    ├── cached_dataset.py
    ├──  core.py
    └── ...
# no __init__.py as this level!

Pros

  • Users have the same flow as they currently have, which is nice and simple: pip install kedro[pandas.CSVDataSet]. They don’t need to worry about the existence of kedro-datasets apart from if they want to manually update to a version beyond the “officially supported” one specified in the kedro requirements
  • In theory you could pip install kedro-datasets and use the datasets by themselves (probably). In practice this might not happen much, but having a self-contained fully-functioning package feels like a clean way to split things up
  • Definitions of AbstractDataSet etc. live in the same place as the implementations. e.g. if we change AbstractDataSet (which I think we should in the not too distant future: see https://github.com/kedro-org/kedro/issues/1778) then this is much easier to handle
  • We break the circular dependencies problem of https://github.com/kedro-org/kedro/issues/1758 because point (1) no longer holds: kedro-datasets does not depend on kedro
  • Documentation problem of https://github.com/kedro-org/kedro/issues/1651 becomes simpler, since all we need to check is which version of kedro-datasets to fetch based on kedro’s requirements.txt

Cons

  • Arguably things like AbstractDataSet are closer to core components (e.g. runner, pipeline) than they are to dataset implementations, which are really the optional extra thing that should be split out.
  • Probably some other things we haven’t thought of yet

Key questions

  • Why was all the stuff in kedro.io supposed to remain in kedro rather than move to kedro-datasets? How important is that?
  • Do the files I suggest moving make sense? Did I miss any dependencies here? I only glanced through to get a rough idea; would be good to do it more carefully.
  • What did I miss here?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
merelchtcommented, Aug 25, 2022

We discussed this issue in a Technical Design session:

  • The team is in agreement that kedro-datasets a dependency of kedro seems like a good idea.
  • Things that need to be clarified:
    • Why should we not move AbstractDataSet to kedro-datasets? What was @idanov 's reasoning to keep it in core kedro?
    • What parts of kedro.io should move to kedro-datasets. @AntonyMilneQB has already made a suggestion for this, but would like to get a second opinion. @noklam raised a good question above: “How will this work if kedro.io.data_catalog stays in kedro and kedro.io.xxxx stays in kedro-datasets”?
1reaction
AntonyMilneQBcommented, Sep 1, 2022

Following a discussion with @idanov, we are not going with this proposal. All of io will remain in kedro and just the datasets will go to kedro-datasets. Ivan thinks all implementations of AbstractDataSet should be treated like they are written by a third party and split out. If we change AbstractDataSet then all third parties (us + anyone else implementing a dataset) will need to modify to match the new interface. MemoryDataSet and the like are “core” dataset implementations and so remain in kedro. @Galileo-Galilei also made a very good point with his Question 2, point 2 in https://github.com/kedro-org/kedro/issues/1758 that would make the proposal here unworkable.

The solution will instead be to make kedro a dependency of kedro-datasets and forget about the “nice to have” pip install kedro[...]. See https://github.com/kedro-org/kedro/issues/1758 for more.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dependencies — Kedro 0.18.4 documentation - Read the Docs
The Data Catalog is your way of interacting with different data types in Kedro. The modular dependencies in this category include pandas ,...
Read more >
Dependencies — Kedro 0.17.7 documentation - Read the Docs
When we introduced Kedro, we touched briefly on how to specify a project's dependencies to make it easier for others to run your...
Read more >
Set up the data — Kedro 0.18.4 documentation - Read the Docs
Register the datasets with the Kedro Data Catalog in conf/base/catalog.yml , which is the registry of all data sources available for use by...
Read more >
Create a data processing pipeline - Kedro - Read the Docs
When Kedro runs the pipeline, it determines that neither dataset is registered in the data catalog, so it stores these as temporary datasets...
Read more >
kedro.extras.datasets — Kedro 0.18.4 documentation
GeoJSONDataSet loads/saves data to a GeoJSON file using an underlying filesystem (eg: local, S3, GCS). HoloviewsWriter saves Holoviews objects to image file(s) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found