Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded

See original GitHub issue

Feature request

Sorry for cryptic name but I’d like to explain using code itself. When I want to load a specific dataset from a repository (for instance, this: https://huggingface.co/datasets/inria-soda/tabular-benchmark)

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
print(next(iter(dataset["train"])))

datasets library is essentially designed for people who’d like to use benchmark datasets on various modalities to fine-tune their models, and these benchmark datasets usually have pre-defined train and test splits. However, for tabular workflows, having train and test splits usually ends up model overfitting to validation split so usually the users would like to do validation techniques like StratifiedKFoldCrossValidation or when they tune for hyperparameters they do GridSearchCrossValidation so often the behavior is to create their own splits. Even in this paper a benchmark is introduced but the split is done by authors. It’s a bit confusing for average tabular user to try and load a dataset and see "train" so it would be nice if we would not load dataset into a split called train by default.

from datasets import load_dataset
dataset = load_dataset("inria-soda/tabular-benchmark", data_files=["reg_cat/house_sales.csv"], streaming=True)
-print(next(iter(dataset["train"])))
+print(next(iter(dataset)))

Motivation

I explained it above 😅

Your contribution

I think this is quite a big change that seems small (e.g. how to determine datasets that will not be load to train split?), it’s best if we discuss first!

Issue Analytics

State:
Created a year ago
Comments:33 (33 by maintainers)

Top GitHub Comments

5reactions

lhoestqcommented, Nov 23, 2022

Started to experiment with merging Dataset and DatasetDict. My plan is to define the splits of a Dataset in Dataset.info.splits (already exists, but never used). A Dataset would then be the concatenation of its splits if they exist.

Not sure yet this is the way to go. My plan is to play with it and see and share it with you, so we can see if it makes sense from a UX point of view.

1reaction

adrinjalalicommented, Nov 29, 2022

yeah what I mean is this:

dataset = load_dataset("blah")

# deal with a split of the dataset
train = dataset["train"]
train_df = dataset["train"].to_dataframe()

# deal with the whole dataset
dataset_df = dataset.to_dataframe()

So we do two things to improve tabular experience:

allow datasets to have a single split
add to_dataframe to the root dict level so that users can simply call df = load_dataset("blah").to_dataframe() and have it in their pandas.DataFrame object.

Top Results From Across the Web

Types overview | Workflows - Google Cloud

Data split result. This contains references to the training and evaluation data tables that were used to train the model.

Building ETL Pipelines — For Beginners | by Aashish Nair

The appeal of an ETL pipeline is that it facilitates data collection, processing, and storage with maximum efficiency and minimal friction.

The 4 Pillars of MLOps: How to Deploy ML Models to Production

Learn how to deploy models to production more effectively with this ultimate guide that explore MLOps and the 4 pillars of machine learning....

Data Science Series EP 4. Introduction to Orange Tool Part-2

It means that there is enough data to split the dataset into train and test datasets and each of the train and test...

Patterns for Managing Source Code Branches - Martin Fowler

This division of development into lines of work that split and merge is central to the workflow of software development teams, and several...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded

Feature request

Motivation

Your contribution

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

"One or several metadata. were found, but not in the same directory or in a parent directory"

Pip-compile: Could not find a version that matches dill<0.3.6,>=0.3.6