RFC Dataset API
See original GitHub issueIn a few issues now we’ve been talking about supporting pandas.DataFrame
s either as an input or the output. I think it’s worth taking a step back and think about why we’d like to do that.
First and foremost, it’s about the metadata. In #10733 we have:
Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age
The way I see it, is that it makes sense to have some [meta]data attached to the columns [and rows], and in general about the dataset itself. Now it’s true that we return a Bunch object with some description and possibly other info about the data in some of the load_...
functions, but we don’t really embrace those information in the rest of the library.
On the other hand, whatever we use other than a numpy.array
, it would be because it gives us some extra functionalities. For instance we see in #11818:
we would also set the DataFrame’s index corresponding to the
is_row_identifier
attribute in OpenML.
That however, raises a few questions. For instance, would we then use that index in any way? As a user, I guess if I see some methods in sklearn giving me the data in the form of a DataFrame, I’d expect the rest of sklearn to also understand DataFrames to a good extent. That may give the user the impression that we’d also understand a DataFrame with an index hierarchy and so on.
#13120 by @daniel-cortez-stevenson is kind of an effort towards having a Dataset object, and uses the fact that some other packages such as PyTorch have a dataset object as a supporting argument. However, as @rth mentioned in https://github.com/scikit-learn/scikit-learn/pull/13120#issuecomment-461856418, one role of a dataset object is to load samples in batches when it’s appropriate and that’s arguably the main reason for having a dataset in a library such as PyTorch.
On the other hand, we have the other issue of routing sample properties throughout the pipeline, and #9566 by @jnothman (which itself lists over 10 issues that it’s potentially solve) is an effort to solve the issue, but it’s pretty challenging and we don’t seem to have a solution/consensus with which we’re happy yet.
Another related issue/trick is that we only support transformation of the input X
, and hence there’s a special TransformedTargetRegressor
class which handles the transformation of the target before applying the regressor, instead of allowing the transformation of both X and y.
Now assume we have a Dataset which would include the metadata of the features and samples, with some additional info that potentially would be used by some models. It would:
- include feature metadata (including names)
- internally keep the data as a
pandas.DataFrame
if necessary - include sample info (such as sample_weights)
- be the input and output to/from transformers, hence allowing the transformation of the output
y
, andsample_weights
if that’s what the transformer is doing - transformers can attach extra information to the dataset
And clearly it would support input/output conversion from numpy array and probably pandas dataframes.
It would then be easy to handle some usecases such as:
- a transformer/model in the pipeline can put a pointer to its
self
in the dataset for a model down the line to potentially use it (we had an issue about this which I cannot find now). - modifying the
sample_weights
in the pipeline - manipulate
y
in the pipeline
What I’m trying to argue here is that the DataFrame does solve some of the issues we have here, but it would probably involve quite a lot of changes in the codebase, at which point, we could contemplate the idea of having a Dataset, which seems to have the capacity to solve quite a few more issues we’re having.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:6 (4 by maintainers)
Top GitHub Comments
Recarrays are difficult. They have a different shape.
Thanks for a detailed summary!
I agree these are things that are worthwile to consider. A partial comment (mostly paraphrasing some of Gael’s slides) is that using standard data contains (ndarrays, DataFrames) is what allows different packages in the eco-system to interact. <end of approximate citation>
Currently, we are adding more pandas support, but there is still no full consensus as to how far this should go (pandas has quite a lot of advanced features). The only reason pandas.DataFrames were considered is because it is now a standard format for columnar data in Python. Even if we could write some custom object that could solve some of the above mentionned issues, the fact that it would not be standard in the community is a very strong limitation.
Personally as a user, I tend to somewhat resist any library that tries to push me to use their dataset wrappers (DL libraries excluding). For instance,
xgboost.DMatrix
probably addresses some issues – as a user I have to admit I don’t care too much, am too lazy to learn a new API and will use ndarray or DataFrames if I can. Maybe I am missing something in that particular case…As I menioneed in https://github.com/scikit-learn/scikit-learn/pull/13120, I think OpenML would also be a good community to think about dataset representation.
Side comment, I would avoid the term SparseDataFrames. These are not sparse in the sense of usually use in scikit-learn i.e. (CSR / CSC arrays) and are mostly just compressed dataframes from what I understood https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html and are not usable for applications where we use sparse arrays (e.g. text processing) last time I tried. Xarray has probably a better shot of getting sparse labeled arrays one day but it’s not there yet https://github.com/pydata/xarray/issues/1375