question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

kedro-datasets: dependencies and package structure. Are we doing the right thing?

See original GitHub issue

Context

Let’s pause and take stock of where we are in https://github.com/kedro-org/kedro/issues/1457. This is where I think things stand:

  • @idanov planned for us to move kedro datasets into a new package kedro-datasets. This would mean users do pip install kedro-datasets[pandas.CSVDataSet] and imports become from kedro-datasets import ...
  • @deepyaman suggested using a namespaced package for kedro-datasets. In short, this would mean that it’s still a separate pip installable package but the import path would still come from the kedro namespace: from kedro.datasets import ...
  • this was generally agreed to be a good idea. The motivation for splitting out kedro-datasets is more for distribution purposes rather than us suggesting that datasets could be used independently of kedro
  • this would mean that instead of doing pip install kedro[pandas.CSVDataSet], a user would do pip install kedro kedro-datasets[pandas.CSVDataSet]. I argued that this doesn’t seem like such a smooth user journey and also it’s actually a bit confusing to pip install kedro-datasets but then import from kedro.datasets rather than from kedro-datsets
  • hence we decided we would maintain the “redirect” in which kedro’s extras_require would ensure that doing pip install kedro[pandas.CSVDataSet] would work as it does now. The intention with this is not purely for backwards compatibility but the recommended way to install kedro-datasets, so that e.g. even in requirements.txt files you would not specify kedro-datasets but instead kedro[...]. See #1495 for more details
  • @noklam raised a very good question about how documentation would work for kedro-datasets: #1651. We decided that it should remain part of the core kedro documentation (i.e. live in same place as API docs on RTD that it does now). I set out a plan for how we could achieve this, but it’s very complicated and not 100% satisfactory
  • while trying to come up with a solution for the documentation question, my spidey sense started tingling. Something didn’t feel quite right, and I thought that beyond the complexity of handling documentation, there may be some deeper issue here with how we’re handling kedro-datasets. I discussed with @deepyaman briefly, who had some interesting ideas

Note. Regardless of whether it’s a namespace package or not, most times something from kedro-datasets is used you wouldn’t actually need to do this import explicitly, since in the data catalog you don’t need to specify the full import path to the dataset type but rather just pandas.CSVDataSet.

Concerns

The current concerns are (feel free to add if anyone has any others):

  • are we making bad circular dependencies?
  • is there just a whole better packaging model that we’ve not considered? e.g. metapackage, which doesn’t directly conflict with the current approach but would influence decisions we make now

Overall, the kedro-datsets work is quite complex. When it was first planned, we were not aware of the possibility of a namespace package, which changes the way we think about it quite a bit. I am concerned that we have not quite got the scheme right yet and might be missing something that would reduce overall complexity. My suggestion to resolve the situation:

  • let’s discuss the circular dependencies issue. Hopefully it’s not a problem at all, but I would like to feel more confident about this
  • let’s investigate how other libraries are handling similar situations. e.g. I believe the idea for kedro-datasets might have been inspired by how django packages different components (?). @deepyaman mentioned jupyter’s metapackage approach. Again, maybe what we are doing is the best approach, but I would like to feel more confident about this. Just as we missed the possibility of namespace packages in the first place, maybe we’re missing something big here

We don’t need to completely pause work on kedro-datasets while we resolve these questions, but I think the outcome does affect some of the tickets (e.g, #1651 #1495). I do think, however, that we shouldn’t release kedro-datasets before we’re really confident on these.

Circular dependencies

This is what first set my spidey sense tingling.

  1. kedro is a dependency of kedro-datsets.
  2. to enable pip install kedro[pandas.CSVDataSet], kedro-datasets becomes an optional dependency of kedro through extra_requires

(1) initially seemed to be non-negotiable to me but @deepyaman pointed out maybe that’s not right (see below conversation). We don’t have to do (2) since we can just require people to pip install kedro-datasets, but it felt like at least a “nice to have” before.

Key question: is this form of circular dependency going to cause problems?

  • if yes, we need to change one of the above 2 points, i.e. either not specify kedro as a dependency of kedro-datasets or revert the decision to enable pip install kedro[pandas.CSVDataSet] and go back to pip install kedro-datasets. This would overall simplify things quite a bit but comes with some disadvantages (most important: not such a smooth user experience, less important: import paths don’t match package name)
  • if no, great. Let’s continue as we are. But we need to think carefully about exactly what kedro’s extra_requires points to (e.g. kedro-datasets~=1.0 is the current plan) and likewise what kedro-datasets specifies as its kedro version specifier

My discussion with @deepyaman: F99F423A-DFF0-404A-BB03-CA6E080D75E1

(Note the last comment here is considering that we should not allow pip install kedro[...] and instead be explicit about pip install kedro-datasets.)

Are we missing something?

Maybe there is a whole different way of handling the kedro vs. kedro-datasets split which would resolve the question of dependencies, what a user should pip install, how to handle the namespace, etc. e.g. @deepyaman suggested a kedro metapackage in which kedro-framework and kedro-datasets are both namespaced packages underneath that.

We don’t need to commit to implementing the kedro-framework split now if we don’t want to, but I think it would be good to get a feeling for whether this a route we might want to go down in future because it influences our current decision on how to handle kedro-datasets. e.g. it might convince us that pip install kedro[pandas.CSVDataSet] is good or bad. 38DC2301-54DB-4A29-9261-698ECD6F82FC 6781F28A-4338-4C40-8221-643D158DA462

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:4
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

6reactions
Galileo-Galileicommented, Sep 9, 2022

Hi kedro team,

I’ve followed with great attention your journey on making kedro-datasets an independent package, and i’d like to share my thoughts on some of the questions which seem still open on this topic.

Question 1 : Should kedro really split kedro-datasets in a separate package?

In my opinion, this is a big yes because it will tremendously improve enterprise support, provided some specific implementation that I’ll detail further.

The major benefit I expect from this split, apart from the ones summarised above by @AntonyMilneQB, is the ability to upgrade only partially between major versions of the framework (technically in terms of SemVer, i am taliking of minor version, but your understand what I mean: kedro-0.16, kedro-0.17, kedro 0.18).

Kedro is becoming more and more prevalent in the industry, but users can’t pay migrations costs very often. My team moved this summer from 0.16.5 to 0.18.2, and reading the discord or the various github issues, it seems that many users are still stuck in 0.16 and 0.17 versions. The download statistics on pepy also indicate that 0.17.x is more used than 0.18.x series, and that 0.16.x, albeit less downloaded, is not completely abandoned by users.

I feel from personal experience (maybe it would need some users research to confirm / quantify it) that what scares users and prevents them from migrating are the template changes. This is a bit ironical since changing the template is often a matter of a couple of minutes, but there is a cost of understanding where objects goes in each new template. My intuition is that most of them would migrate much more often if they could just pip install kedro with the newest version.

Some good news though: the motivation for migration is very often (once again, based on personal experience) to get some improvements for datasets, for instance:

  • newer datasets that do not exist in old versions
  • annoying bugs in some datasets (e.g. old implementations of MatplotlibWriter)
  • incomplete features for some datasets (e.g. old implementations of ApiDataSet)
  • new fsspec protocol in more recent versions (smb, ftp, abfss…)
  • upgrade old dependencies in kedro requirements which creates conflicts with other librairies (e.g. fsspec<0.7 in kedro-0.16 is breaking many packages!)

It would feel much more modular and safer to be able to upgrade an application in production gradually by upgrading only the kedro-datasets version in its requirements rather than modifying the entire template, and it will enable to solve all above common feature requests.

Obviously, users will have to migrate entirely at some point, but being able to upgrade datasets much faster than we are able to do now would be a tremendous improvement for production maintenance (my team has maintained custom plugins for fsspec connections with unsupported protocol for two years because we were not able to migrate, while it would be awesome to just upgrade a version number with kedro-datasets!).

Question 2 : What should be the dependencies relationship between kedro and kedro-datasets?

Three scenarios are on the table at the moment. Based on q1, if we want to enable upgrading kedro-datasets with very old kedro-versions:

  1. kedro-datasets import kedro and reciprocally. This scenario is a no go for me. Apart from the circular dependency issue you are facing and discussing above, this makes the desired feature of upgrading easily only the kedro-datasets (cf. question 1) almost impossible to achieve. Indeed , kedro-datasets would reinstall a newer version of kedro incompatible with the template of your old project, except if requirement bounds are very extensive which is unlikely.

  2. kedro import kedro-datasets but not the opposite.

This feels quite natural, because it avoids asking users to both packages. However, I would find this very unpleasant if kedro-datasets upper bound was too tight and prevents me from upgrading easily. This is very likely if any upper bound is set, because many breaking changes in kedro-datasets would not be breaking from the kedro point of view (i.e. a breaking change will occur in one specific dataset implementation, but no breaking change in the “core” module, i.e. the AbstractDataset will still have load and save methods). It is very likely that users do want to benefit from breaking changes to specific dataset implementations and be able to upgrade the package which will raise pip VersionConflicts if the upper bound is set too tight.

  1. Ask users to install separately kedro and kedro-datasets separately

This is my preferred option, because this would make the updates over versions very easy, since the user would be responsible for managing the dependencies.

I understand that it is less users friendly and that you would likely get a lot of users claiming that they’d like to have both installed automatically, but if they get a very clear error message on their first kedro run, I guess it should be pretty ok. Another possibility is to make kedro-datasets a dependency of kedro with no upper bound, but I’m pretty sure you won’t like this option 😃

As a side note, I totally agree that documentaiton should still be hosted in the same place whatever is decided in the end for 2 reasons:

  • it would make clear users have to install kedro-datasets
  • everything will be searchable in the same place

Question 3 : what part of the kedro.io folder should move to kedro-datasets?

Basically it seems a consensus that all specific implementations + lambda / memory /partitioned /cached datasets (as well as load_obj utils) should be moved to kedro-datasets and it feels completly natural.

Regarding the AbstractDataSet and AbstractVersionDataSet, I am completly convinced they belong to kedro-datasets. The key arguments are:

  • not moving it would make kedro-datasets have kedro has a dependency, which is my worst scenario as described in question 2
  • this would enable to create custom datasets without importing kedro. From an upgrade perspective, it would be great to benefit from any improvement to this dataset for a custom implementation inside a project.
  • kedro should not know how AbstractDataSet work under the hood. The only “contract” between the two is that a dataset has a load and a save method. This is already done because you assume pickle libray have load and dumps methods.

Regarding the DataCatalog, I have less stronger feelings, but I feel that it should be part of kedro-datasets too. This is the native “container” for datasets, and I don’t think people have ever customized it (but I may be wrong!), and if someone wants to use the package without the rest of kedro, this seems natural to have this utilities accessible directly.

Question 4: should kedro namespace kedro-datasets ?

From the first time this idea has been suggested, I feel there are much more drawbacks than advantages, but I understand the arguments at stake here.

Overall, I think that there are many cons to this :

  • the engineering setup cost seems higher thaty what you expected at first (but actually, this is not really a con, it is totally up to you to estimate if it is worth the cost)
  • it is very confusing for users to know what’s going on internally. It seems quite easy to understand that kedro_datasets.pandas.CSVDataSet imports the module (and it is eventually easy to go check the code), while kedro.pandas.CSVDataSet obfuscates a lot the fact that the code lies in kedro-datasets package.
  • if I understand well, the main motivation of making this namespacing is to help people using absolute import in their catalog instead of usual relative import to upgrade transparently (=the ones who currently use kedro.extras.dataset.pandas.CSVDataSet instead of pandas.CSVDataSet). This does not seem a good motivation because:
    • these people are very likely a very small part of users
    • these people will have to suffer migration costs to 0.19.0 whatsoever, and I am deeply convinced that upgrading the catalog path will be extremely easy for them because they understand the underlying import mechanism.
    • even worse, this should be counterproductive because :
      • it may be counterintuitive for them (why should I still use kedro.extras when the code is in kedro-datasets?)
      • it likely stands against their initial motivation (I guess that using the absolute path is to make clear to readers where the code is, and if you read an import written as kedro.extras.datasets.pandas.CSVDataSet but there are no such folder in the kedro repo, this is very confusing)
  • There are very dangerous side effects :

Non answered questions :

  • what is the way to package and release a distribution of subpackages instead of a single package? I am no expert and don’t know what are recommend best practices here, but tidyverse is a well known distribution in R which may be informative. The key idea is that you can install each package separately (e.g. kedro-datasets and kedro-framework) AND install the entire distribution (pip install kedro) so you can have both flexibility and ease of upgrade (if packages are installed separately) and ease of install (if the user install the entire distribution).
1reaction
AhdraMeraliQBcommented, Sep 15, 2022

Notes from technical design discussion on 14 September

After consideration of the 4 approaches outlined above, we agreed that the most correct way to proceed would be to Metapackage (option 4), but the engineering costs involved were not justified by the value addition of being able to import from kedro.datasets instead of kedro_datasets. Additionally, once implemented, it is very difficult to reverse metapackaging whilst minimising how the users are affected - it is currently just too high of a commitment. As such, we will be closing this issue and #1693.

Points to follow up on

@yetudada highlighted the addition in complexity for the users should we continue to separate out kedro-datasets without namespacing. We should conduct some user interviews to gauge how they feel about splitting out the datasets.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Make kedro-datasets a dependency of kedro? #1776
... We break the circular dependencies problem of kedro-datasets: dependencies and package structure. Are we doing the right thing?
Read more >
Dependencies — Kedro 0.18.4 documentation - Read the Docs
You can use Kedro to specify a project's exact dependencies to make it easier for you and others to run your project in...
Read more >
kedro.io.DataCatalog — Kedro 0.18.4 documentation
DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog , you...
Read more >
Modular pipelines — Kedro 0.18.4 documentation
You can generate this file structure with the CLI command kedro pipeline ... two pipelines must be connected, but do not share any...
Read more >
Dependencies — Kedro 0.17.6 documentation - Read the Docs
kedro install automatically compiles project dependencies by running kedro build-reqs behind the scenes if the src/requirements.in file doesn't exist.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found