kedro-datasets: dependencies and package structure. Are we doing the right thing?
See original GitHub issueContext
Let’s pause and take stock of where we are in https://github.com/kedro-org/kedro/issues/1457. This is where I think things stand:
- @idanov planned for us to move kedro datasets into a new package
kedro-datasets
. This would mean users dopip install kedro-datasets[pandas.CSVDataSet]
and imports becomefrom kedro-datasets import ...
- @deepyaman suggested using a namespaced package for
kedro-datasets
. In short, this would mean that it’s still a separatepip install
able package but the import path would still come from thekedro
namespace:from kedro.datasets import ...
- this was generally agreed to be a good idea. The motivation for splitting out
kedro-datasets
is more for distribution purposes rather than us suggesting that datasets could be used independently of kedro - this would mean that instead of doing
pip install kedro[pandas.CSVDataSet]
, a user would dopip install kedro kedro-datasets[pandas.CSVDataSet]
. I argued that this doesn’t seem like such a smooth user journey and also it’s actually a bit confusing topip install kedro-datasets
but then importfrom kedro.datasets
rather thanfrom kedro-datsets
- hence we decided we would maintain the “redirect” in which
kedro
’sextras_require
would ensure that doingpip install kedro[pandas.CSVDataSet]
would work as it does now. The intention with this is not purely for backwards compatibility but the recommended way to installkedro-datasets
, so that e.g. even in requirements.txt files you would not specifykedro-datasets
but insteadkedro[...]
. See #1495 for more details - @noklam raised a very good question about how documentation would work for
kedro-datasets
: #1651. We decided that it should remain part of the core kedro documentation (i.e. live in same place as API docs on RTD that it does now). I set out a plan for how we could achieve this, but it’s very complicated and not 100% satisfactory - while trying to come up with a solution for the documentation question, my spidey sense started tingling. Something didn’t feel quite right, and I thought that beyond the complexity of handling documentation, there may be some deeper issue here with how we’re handling
kedro-datasets
. I discussed with @deepyaman briefly, who had some interesting ideas
Note. Regardless of whether it’s a namespace package or not, most times something from
kedro-datasets
is used you wouldn’t actually need to do this import explicitly, since in the data catalog you don’t need to specify the full import path to the dataset type but rather justpandas.CSVDataSet
.
Concerns
The current concerns are (feel free to add if anyone has any others):
- are we making bad circular dependencies?
- is there just a whole better packaging model that we’ve not considered? e.g. metapackage, which doesn’t directly conflict with the current approach but would influence decisions we make now
Overall, the kedro-datsets
work is quite complex. When it was first planned, we were not aware of the possibility of a namespace package, which changes the way we think about it quite a bit. I am concerned that we have not quite got the scheme right yet and might be missing something that would reduce overall complexity. My suggestion to resolve the situation:
- let’s discuss the circular dependencies issue. Hopefully it’s not a problem at all, but I would like to feel more confident about this
- let’s investigate how other libraries are handling similar situations. e.g. I believe the idea for
kedro-datasets
might have been inspired by howdjango
packages different components (?). @deepyaman mentionedjupyter
’s metapackage approach. Again, maybe what we are doing is the best approach, but I would like to feel more confident about this. Just as we missed the possibility of namespace packages in the first place, maybe we’re missing something big here
We don’t need to completely pause work on kedro-datasets
while we resolve these questions, but I think the outcome does affect some of the tickets (e.g, #1651 #1495). I do think, however, that we shouldn’t release kedro-datasets
before we’re really confident on these.
Circular dependencies
This is what first set my spidey sense tingling.
kedro
is a dependency ofkedro-datsets
.- to enable
pip install kedro[pandas.CSVDataSet]
,kedro-datasets
becomes an optional dependency ofkedro
throughextra_requires
(1) initially seemed to be non-negotiable to me but @deepyaman pointed out maybe that’s not right (see below conversation). We don’t have to do (2) since we can just require people to pip install kedro-datasets
, but it felt like at least a “nice to have” before.
Key question: is this form of circular dependency going to cause problems?
- if yes, we need to change one of the above 2 points, i.e. either not specify
kedro
as a dependency ofkedro-datasets
or revert the decision to enablepip install kedro[pandas.CSVDataSet]
and go back topip install kedro-datasets
. This would overall simplify things quite a bit but comes with some disadvantages (most important: not such a smooth user experience, less important: import paths don’t match package name) - if no, great. Let’s continue as we are. But we need to think carefully about exactly what
kedro
’sextra_requires
points to (e.g.kedro-datasets~=1.0
is the current plan) and likewise whatkedro-datasets
specifies as itskedro
version specifier
My discussion with @deepyaman:
(Note the last comment here is considering that we should not allow pip install kedro[...]
and instead be explicit about pip install kedro-datasets
.)
Are we missing something?
Maybe there is a whole different way of handling the kedro
vs. kedro-datasets
split which would resolve the question of dependencies, what a user should pip install
, how to handle the namespace, etc. e.g. @deepyaman suggested a kedro
metapackage in which kedro-framework
and kedro-datasets
are both namespaced packages underneath that.
We don’t need to commit to implementing the kedro-framework
split now if we don’t want to, but I think it would be good to get a feeling for whether this a route we might want to go down in future because it influences our current decision on how to handle kedro-datasets
. e.g. it might convince us that pip install kedro[pandas.CSVDataSet]
is good or bad.
Issue Analytics
- State:
- Created a year ago
- Reactions:4
- Comments:11 (11 by maintainers)
Top GitHub Comments
Hi kedro team,
I’ve followed with great attention your journey on making
kedro-datasets
an independent package, and i’d like to share my thoughts on some of the questions which seem still open on this topic.Question 1 : Should kedro really split kedro-datasets in a separate package?
In my opinion, this is a big yes because it will tremendously improve enterprise support, provided some specific implementation that I’ll detail further.
The major benefit I expect from this split, apart from the ones summarised above by @AntonyMilneQB, is the ability to upgrade only partially between major versions of the framework (technically in terms of SemVer, i am taliking of minor version, but your understand what I mean: kedro-0.16, kedro-0.17, kedro 0.18).
Kedro is becoming more and more prevalent in the industry, but users can’t pay migrations costs very often. My team moved this summer from 0.16.5 to 0.18.2, and reading the discord or the various github issues, it seems that many users are still stuck in 0.16 and 0.17 versions. The download statistics on pepy also indicate that 0.17.x is more used than 0.18.x series, and that 0.16.x, albeit less downloaded, is not completely abandoned by users.
I feel from personal experience (maybe it would need some users research to confirm / quantify it) that what scares users and prevents them from migrating are the template changes. This is a bit ironical since changing the template is often a matter of a couple of minutes, but there is a cost of understanding where objects goes in each new template. My intuition is that most of them would migrate much more often if they could just
pip install kedro
with the newest version.Some good news though: the motivation for migration is very often (once again, based on personal experience) to get some improvements for datasets, for instance:
MatplotlibWriter
)ApiDataSet
)fsspec
protocol in more recent versions (smb
,ftp
,abfss
…)fsspec<0.7
inkedro-0.16
is breaking many packages!)It would feel much more modular and safer to be able to upgrade an application in production gradually by upgrading only the
kedro-datasets
version in its requirements rather than modifying the entire template, and it will enable to solve all above common feature requests.Obviously, users will have to migrate entirely at some point, but being able to upgrade datasets much faster than we are able to do now would be a tremendous improvement for production maintenance (my team has maintained custom plugins for
fsspec
connections with unsupported protocol for two years because we were not able to migrate, while it would be awesome to just upgrade a version number with kedro-datasets!).Question 2 : What should be the dependencies relationship between kedro and kedro-datasets?
Three scenarios are on the table at the moment. Based on q1, if we want to enable upgrading kedro-datasets with very old kedro-versions:
kedro-datasets import kedro and reciprocally. This scenario is a no go for me. Apart from the circular dependency issue you are facing and discussing above, this makes the desired feature of upgrading easily only the
kedro-datasets
(cf. question 1) almost impossible to achieve. Indeed , kedro-datasets would reinstall a newer version of kedro incompatible with the template of your old project, except if requirement bounds are very extensive which is unlikely.kedro import kedro-datasets but not the opposite.
This feels quite natural, because it avoids asking users to both packages. However, I would find this very unpleasant if
kedro-datasets
upper bound was too tight and prevents me from upgrading easily. This is very likely if any upper bound is set, because many breaking changes inkedro-datasets
would not be breaking from the kedro point of view (i.e. a breaking change will occur in one specific dataset implementation, but no breaking change in the “core” module, i.e. the AbstractDataset will still haveload
andsave
methods). It is very likely that users do want to benefit from breaking changes to specific dataset implementations and be able to upgrade the package which will raise pip VersionConflicts if the upper bound is set too tight.kedro
andkedro-datasets
separatelyThis is my preferred option, because this would make the updates over versions very easy, since the user would be responsible for managing the dependencies.
I understand that it is less users friendly and that you would likely get a lot of users claiming that they’d like to have both installed automatically, but if they get a very clear error message on their first
kedro run
, I guess it should be pretty ok. Another possibility is to makekedro-datasets
a dependency of kedro with no upper bound, but I’m pretty sure you won’t like this option 😃As a side note, I totally agree that documentaiton should still be hosted in the same place whatever is decided in the end for 2 reasons:
Question 3 : what part of the
kedro.io
folder should move to kedro-datasets?Basically it seems a consensus that all specific implementations + lambda / memory /partitioned /cached datasets (as well as
load_obj
utils) should be moved tokedro-datasets
and it feels completly natural.Regarding the
AbstractDataSet
andAbstractVersionDataSet
, I am completly convinced they belong tokedro-datasets
. The key arguments are:kedro-datasets
have kedro has a dependency, which is my worst scenario as described in question 2AbstractDataSet
work under the hood. The only “contract” between the two is that a dataset has aload
and asave
method. This is already done because you assume pickle libray haveload
anddumps
methods.Regarding the
DataCatalog
, I have less stronger feelings, but I feel that it should be part of kedro-datasets too. This is the native “container” for datasets, and I don’t think people have ever customized it (but I may be wrong!), and if someone wants to use the package without the rest of kedro, this seems natural to have this utilities accessible directly.Question 4: should kedro namespace kedro-datasets ?
From the first time this idea has been suggested, I feel there are much more drawbacks than advantages, but I understand the arguments at stake here.
Overall, I think that there are many cons to this :
kedro_datasets.pandas.CSVDataSet
imports the module (and it is eventually easy to go check the code), whilekedro.pandas.CSVDataSet
obfuscates a lot the fact that the code lies inkedro-datasets
package.kedro.extras.dataset.pandas.CSVDataSet
instead ofpandas.CSVDataSet
). This does not seem a good motivation because:kedro.extras
when the code is inkedro-datasets
?)kedro.extras.datasets.pandas.CSVDataSet
but there are no such folder in the kedro repo, this is very confusing)Non answered questions :
tidyverse
is a well known distribution in R which may be informative. The key idea is that you can install each package separately (e.g.kedro-datasets
andkedro-framework
) AND install the entire distribution (pip install kedro
) so you can have both flexibility and ease of upgrade (if packages are installed separately) and ease of install (if the user install the entire distribution).Notes from technical design discussion on 14 September
After consideration of the 4 approaches outlined above, we agreed that the most correct way to proceed would be to Metapackage (option 4), but the engineering costs involved were not justified by the value addition of being able to import from
kedro.datasets
instead ofkedro_datasets
. Additionally, once implemented, it is very difficult to reverse metapackaging whilst minimising how the users are affected - it is currently just too high of a commitment. As such, we will be closing this issue and #1693.Points to follow up on
@yetudada highlighted the addition in complexity for the users should we continue to separate out
kedro-datasets
without namespacing. We should conduct some user interviews to gauge how they feel about splitting out the datasets.