Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded
See original GitHub issueFeature request
Sorry for cryptic name but I’d like to explain using code itself. When I want to load a specific dataset from a repository (for instance, this: https://huggingface.co/datasets/inria-soda/tabular-benchmark)
from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
print(next(iter(dataset["train"])))
datasets
library is essentially designed for people who’d like to use benchmark datasets on various modalities to fine-tune their models, and these benchmark datasets usually have pre-defined train and test splits. However, for tabular workflows, having train and test splits usually ends up model overfitting to validation split so usually the users would like to do validation techniques like StratifiedKFoldCrossValidation
or when they tune for hyperparameters they do GridSearchCrossValidation
so often the behavior is to create their own splits. Even in this paper a benchmark is introduced but the split is done by authors.
It’s a bit confusing for average tabular user to try and load a dataset and see "train"
so it would be nice if we would not load dataset into a split called train
by default.
from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
-print(next(iter(dataset["train"])))
+print(next(iter(dataset)))
Motivation
I explained it above 😅
Your contribution
I think this is quite a big change that seems small (e.g. how to determine datasets that will not be load to train split?), it’s best if we discuss first!
Issue Analytics
- State:
- Created a year ago
- Comments:33 (33 by maintainers)
Top GitHub Comments
Started to experiment with merging Dataset and DatasetDict. My plan is to define the splits of a Dataset in Dataset.info.splits (already exists, but never used). A Dataset would then be the concatenation of its splits if they exist.
Not sure yet this is the way to go. My plan is to play with it and see and share it with you, so we can see if it makes sense from a UX point of view.
yeah what I mean is this:
So we do two things to improve tabular experience:
to_dataframe
to the root dict level so that users can simply calldf = load_dataset("blah").to_dataframe()
and have it in theirpandas.DataFrame
object.