[KED-1273] Using transformers to specify Python objects in DataSets
See original GitHub issueI just wanted to say first off, I love the software so far. Thank you for releasing this!
I do however, want to talk about datasets.
Kedro has structured all of its datasets around the pandas dataframe, and I’m not in love with this.
the issue
First off, pandas is a very large package for just io. It also unfortunately suffers from a volatile API. For me, it’s not obvious that csv objects ought to be loaded/written into/from a pandas dataframe, and I really wouldn’t want to include pandas as a dependency for my project if I’m not using it at all, especially if I’m dockerizing the package. Returning a file descriptor, numpy array, pyarrow table, or otherwise might be a better choice depending on the use case, and the name CSVLocalDataset certainly doesn’t imply anything about the pandas library, I could easily imagine passing it a nested list instead.
the proposal
My proposal would be to consider adding in some additional modularity regarding what type of python object you’d like to get in/out of your datasets. Namely, pandas DataFrames should be a Transformer
class inheriting from AbstractTransformer
compatible with each of these dataset types (with an optional pandas dependency on import if possible). I think it would be great to have each of the datasets return more natural file descriptors, then have numpy, pandas, dask, etc. transformers which specify how you want to be saving and loading these things to/from a csv.
This gives some nice freedom to users to choose how to handle the dataset objects. It also promotes a nice way of thinking about datasets and transformations from various python objects, abstracting the python data classes out of the fileIO.
Thoughts?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:8 (6 by maintainers)
Top GitHub Comments
That makes sense, but I guess what I might suggest is two things:
What’s awkward for me, is creating a numpy or pure python CSV loader. In extending the kedro io classes, I end up writing:
NumpyCSVLocalDataSet
,ListCSVLocalDataSet
, orDaskArrayCSVLocalDataSet
, meanwhile the |X| element is obfuscated from the standard kedro io classes. It was a bit awkward for me to figure out how I ought to be naming these new datasets I had to create by need, when the class description ofCSVLocalDataSet
also naturally describes them.I understand this is getting verbose, but once you integrate
fsspec
it sounds like you’ll be eliminating the fs aspect, so you’d end up with:CSVDataSet
So, perhaps naming it
PandasCSVDataSet
orDataFrameCSVDataSet
instead is more clear, and offers a natural naming convention for users implementing their own datasets for CSV io.Thank you so much for raising this @dasturge. I’m going to answer this one by breaking your issue into two parts:
pandas
as a dependencyOn 1. you’ll be glad to know that we have it on our backlog to remove
pandas
andnumpy
as core dependencies in Kedro. The issue evolved out of a request to create aKedro-Glue
plugin and users not being able to do this because of our dependency onpandas
andnumpy
for our built-in datasets (issue #57). So we’re implementing a version of #178 soon.On 2. we’ll have a discussion on this one. We’re looking at ways to limit the insane amounts of DataSets in the Data Catalog and this might be a solve for this. One change you’ll see in
kedro
next year is the use offsspec
to abstract file storage to createCSVDataSet
, eventually deprecatingCSVLocalDataSet
,CSVS3DataSet
,CSVGCSDataSet
but even in this system we would still haveCSVDataSet
andCSVDaskDataSet
and so on. So I’ll circle back and get back to you on this.Let me tag this issue with a ticket so we can add it to our backlog for discussion.