question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded

See original GitHub issue

Feature request

Sorry for cryptic name but I’d like to explain using code itself. When I want to load a specific dataset from a repository (for instance, this: https://huggingface.co/datasets/inria-soda/tabular-benchmark)

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
print(next(iter(dataset["train"])))

datasets library is essentially designed for people who’d like to use benchmark datasets on various modalities to fine-tune their models, and these benchmark datasets usually have pre-defined train and test splits. However, for tabular workflows, having train and test splits usually ends up model overfitting to validation split so usually the users would like to do validation techniques like StratifiedKFoldCrossValidation or when they tune for hyperparameters they do GridSearchCrossValidation so often the behavior is to create their own splits. Even in this paper a benchmark is introduced but the split is done by authors. It’s a bit confusing for average tabular user to try and load a dataset and see "train" so it would be nice if we would not load dataset into a split called train by default.

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
-print(next(iter(dataset["train"])))
+print(next(iter(dataset)))

Motivation

I explained it above 😅

Your contribution

I think this is quite a big change that seems small (e.g. how to determine datasets that will not be load to train split?), it’s best if we discuss first!

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:33 (33 by maintainers)

github_iconTop GitHub Comments

5reactions
lhoestqcommented, Nov 23, 2022

Started to experiment with merging Dataset and DatasetDict. My plan is to define the splits of a Dataset in Dataset.info.splits (already exists, but never used). A Dataset would then be the concatenation of its splits if they exist.

Not sure yet this is the way to go. My plan is to play with it and see and share it with you, so we can see if it makes sense from a UX point of view.

1reaction
adrinjalalicommented, Nov 29, 2022

yeah what I mean is this:

dataset = load_dataset("blah")

# deal with a split of the dataset
train = dataset["train"]
train_df = dataset["train"].to_dataframe()

# deal with the whole dataset
dataset_df = dataset.to_dataframe()

So we do two things to improve tabular experience:

  • allow datasets to have a single split
  • add to_dataframe to the root dict level so that users can simply call df = load_dataset("blah").to_dataframe() and have it in their pandas.DataFrame object.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Types overview | Workflows - Google Cloud
Data split result. This contains references to the training and evaluation data tables that were used to train the model.
Read more >
Building ETL Pipelines — For Beginners | by Aashish Nair
The appeal of an ETL pipeline is that it facilitates data collection, processing, and storage with maximum efficiency and minimal friction.
Read more >
The 4 Pillars of MLOps: How to Deploy ML Models to Production
Learn how to deploy models to production more effectively with this ultimate guide that explore MLOps and the 4 pillars of machine learning....
Read more >
Data Science Series EP 4. Introduction to Orange Tool Part-2
It means that there is enough data to split the dataset into train and test datasets and each of the train and test...
Read more >
Patterns for Managing Source Code Branches - Martin Fowler
This division of development into lines of work that split and merge is central to the workflow of software development teams, and several...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found