Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-1273] Using transformers to specify Python objects in DataSets

See original GitHub issue

I just wanted to say first off, I love the software so far. Thank you for releasing this!

I do however, want to talk about datasets.

Kedro has structured all of its datasets around the pandas dataframe, and I’m not in love with this.

the issue

First off, pandas is a very large package for just io. It also unfortunately suffers from a volatile API. For me, it’s not obvious that csv objects ought to be loaded/written into/from a pandas dataframe, and I really wouldn’t want to include pandas as a dependency for my project if I’m not using it at all, especially if I’m dockerizing the package. Returning a file descriptor, numpy array, pyarrow table, or otherwise might be a better choice depending on the use case, and the name CSVLocalDataset certainly doesn’t imply anything about the pandas library, I could easily imagine passing it a nested list instead.

the proposal

My proposal would be to consider adding in some additional modularity regarding what type of python object you’d like to get in/out of your datasets. Namely, pandas DataFrames should be a Transformer class inheriting from AbstractTransformer compatible with each of these dataset types (with an optional pandas dependency on import if possible). I think it would be great to have each of the datasets return more natural file descriptors, then have numpy, pandas, dask, etc. transformers which specify how you want to be saving and loading these things to/from a csv.

This gives some nice freedom to users to choose how to handle the dataset objects. It also promotes a nice way of thinking about datasets and transformations from various python objects, abstracting the python data classes out of the fileIO.

Thoughts?

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:8 (6 by maintainers)

Top GitHub Comments

3reactions

dasturgecommented, Dec 19, 2019

That makes sense, but I guess what I might suggest is two things:

make the pandas dependency optional: i.e. only on import of pandas io classes (for smaller kedro docker containers when possible)
change the naming conventions on the datasets to reflect use of pandas

What’s awkward for me, is creating a numpy or pure python CSV loader. In extending the kedro io classes, I end up writing:

NumpyCSVLocalDataSet, ListCSVLocalDataSet, or DaskArrayCSVLocalDataSet, meanwhile the |X| element is obfuscated from the standard kedro io classes. It was a bit awkward for me to figure out how I ought to be naming these new datasets I had to create by need, when the class description of CSVLocalDataSet also naturally describes them.

I understand this is getting verbose, but once you integrate fsspec it sounds like you’ll be eliminating the fs aspect, so you’d end up with:

CSVDataSet

So, perhaps naming it PandasCSVDataSet or DataFrameCSVDataSet instead is more clear, and offers a natural naming convention for users implementing their own datasets for CSV io.

1reaction

yetudadacommented, Dec 11, 2019

Thank you so much for raising this @dasturge. I’m going to answer this one by breaking your issue into two parts:

pandas as a dependency
Python object modularity for the datasets

On 1. you’ll be glad to know that we have it on our backlog to remove pandas and numpy as core dependencies in Kedro. The issue evolved out of a request to create a Kedro-Glue plugin and users not being able to do this because of our dependency on pandas and numpy for our built-in datasets (issue #57). So we’re implementing a version of #178 soon.

On 2. we’ll have a discussion on this one. We’re looking at ways to limit the insane amounts of DataSets in the Data Catalog and this might be a solve for this. One change you’ll see in kedro next year is the use of fsspec to abstract file storage to create CSVDataSet, eventually deprecating CSVLocalDataSet, CSVS3DataSet, CSVGCSDataSet but even in this system we would still have CSVDataSet and CSVDaskDataSet and so on. So I’ll circle back and get back to you on this.

Let me tag this issue with a ticket so we can add it to our backlog for discussion.

Top Results From Across the Web

[KED-1273] Using transformers to specify Python objects in ...

Namely, pandas DataFrames should be a Transformer class inheriting from AbstractTransformer compatible with each of these dataset types (with an ...

Main classes - Hugging Face

Iterate through the examples. If a formatting is set with Dataset.set_format() rows will be returned with the selected format.

shelve — Python object persistence — Python 3.11.1 ...

A “shelf” is a persistent, dictionary-like object. The difference with “dbm” databases is that the values (not the keys!) in a shelf can...

Custom Transformers and Pipelines in Python

transform is used to perform the transformation on the input data set using the parameters from the fit function. Here, it's creating a...

Overview: estimators, transformers and pipelines - spark.ml

DataFrame : Spark ML uses DataFrame from Spark SQL as an ML dataset, ... Spark ML Estimator s and Transformer s use a...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[KED-1273] Using transformers to specify Python objects in DataSets

the issue

the proposal

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[KED-1274] Kedro Jupyter Notebook Issue- context not loading / %reload_kedro throws error

[KED-1271] Tensorflow Model Dataset