Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC Dataset API

See original GitHub issue

In a few issues now we’ve been talking about supporting pandas.DataFrames either as an input or the output. I think it’s worth taking a step back and think about why we’d like to do that.

First and foremost, it’s about the metadata. In #10733 we have:

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age

The way I see it, is that it makes sense to have some [meta]data attached to the columns [and rows], and in general about the dataset itself. Now it’s true that we return a Bunch object with some description and possibly other info about the data in some of the load_... functions, but we don’t really embrace those information in the rest of the library.

On the other hand, whatever we use other than a numpy.array, it would be because it gives us some extra functionalities. For instance we see in #11818:

we would also set the DataFrame’s index corresponding to the is_row_identifier attribute in OpenML.

That however, raises a few questions. For instance, would we then use that index in any way? As a user, I guess if I see some methods in sklearn giving me the data in the form of a DataFrame, I’d expect the rest of sklearn to also understand DataFrames to a good extent. That may give the user the impression that we’d also understand a DataFrame with an index hierarchy and so on.

#13120 by @daniel-cortez-stevenson is kind of an effort towards having a Dataset object, and uses the fact that some other packages such as PyTorch have a dataset object as a supporting argument. However, as @rth mentioned in https://github.com/scikit-learn/scikit-learn/pull/13120#issuecomment-461856418, one role of a dataset object is to load samples in batches when it’s appropriate and that’s arguably the main reason for having a dataset in a library such as PyTorch.

On the other hand, we have the other issue of routing sample properties throughout the pipeline, and #9566 by @jnothman (which itself lists over 10 issues that it’s potentially solve) is an effort to solve the issue, but it’s pretty challenging and we don’t seem to have a solution/consensus with which we’re happy yet.

Another related issue/trick is that we only support transformation of the input X, and hence there’s a special TransformedTargetRegressor class which handles the transformation of the target before applying the regressor, instead of allowing the transformation of both X and y.

Now assume we have a Dataset which would include the metadata of the features and samples, with some additional info that potentially would be used by some models. It would:

include feature metadata (including names)
internally keep the data as a pandas.DataFrame if necessary
include sample info (such as sample_weights)
be the input and output to/from transformers, hence allowing the transformation of the output y, and sample_weights if that’s what the transformer is doing
transformers can attach extra information to the dataset

And clearly it would support input/output conversion from numpy array and probably pandas dataframes.

It would then be easy to handle some usecases such as:

a transformer/model in the pipeline can put a pointer to its self in the dataset for a model down the line to potentially use it (we had an issue about this which I cannot find now).
modifying the sample_weights in the pipeline
manipulate y in the pipeline

What I’m trying to argue here is that the DataFrame does solve some of the issues we have here, but it would probably involve quite a lot of changes in the codebase, at which point, we could contemplate the idea of having a Dataset, which seems to have the capacity to solve quite a few more issues we’re having.

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Feb 14, 2019

Recarrays are difficult. They have a different shape.

1reaction

rthcommented, Feb 8, 2019

Thanks for a detailed summary!

I agree these are things that are worthwile to consider. A partial comment (mostly paraphrasing some of Gael’s slides) is that using standard data contains (ndarrays, DataFrames) is what allows different packages in the eco-system to interact. <end of approximate citation>

Currently, we are adding more pandas support, but there is still no full consensus as to how far this should go (pandas has quite a lot of advanced features). The only reason pandas.DataFrames were considered is because it is now a standard format for columnar data in Python. Even if we could write some custom object that could solve some of the above mentionned issues, the fact that it would not be standard in the community is a very strong limitation.

Personally as a user, I tend to somewhat resist any library that tries to push me to use their dataset wrappers (DL libraries excluding). For instance, xgboost.DMatrix probably addresses some issues – as a user I have to admit I don’t care too much, am too lazy to learn a new API and will use ndarray or DataFrames if I can. Maybe I am missing something in that particular case…

As I menioneed in https://github.com/scikit-learn/scikit-learn/pull/13120, I think OpenML would also be a good community to think about dataset representation.

as [Sparse]DataFrames with named columns

Side comment, I would avoid the term SparseDataFrames. These are not sparse in the sense of usually use in scikit-learn i.e. (CSR / CSC arrays) and are mostly just compressed dataframes from what I understood https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html and are not usable for applications where we use sparse arrays (e.g. text processing) last time I tried. Xarray has probably a better shot of getting sparse labeled arrays one day but it’s not there yet https://github.com/pydata/xarray/issues/1375

Top Results From Across the Web

Datatracker API Notes

The datatracker API uses tastypie to generate an API which mirrors the Django ORM (Object Relational Mapping) for the database. Each Django model...

RFC 86: Column-oriented read API for vector layers - GDAL

This RFC describes the addition of new methods to the OGRLayer class to retrieve batches of features with a column-oriented memory layout, that ......

RFC API MX - Verify the RFC and personal information via our ...

RFC API - The one definitive source for KYC in Mexico. The API allows developers to validate RFC and personal information via a...

Keras RFC Implementation - ImagePipeline based on tf.data API

tf.data.Dataset is the defacto way of loading and preprocessing data in TensorFlow for training machine learning models on CPUs, GPUs and TPUs.

RFC: New HDF5 API Routines for HPC Applications Read ...

September 28, 2022. RFC THG 2012-08-28.v8. Page 1 of 14. RFC: New HDF5 API Routines for HPC Applications. Read/Write Multiple Datasets in an...