Custom feature types in `load_dataset` from CSV
See original GitHub issueI am trying to load a local file with the load_dataset
function and I want to predefine the feature types with the features
argument. However, the types are always the same independent of the value of features
.
I am working with the local files from the emotion dataset. To get the data you can use the following code:
from pathlib import Path
import wget
EMOTION_PATH = Path("./data/emotion")
DOWNLOAD_URLS = [
"https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1",
"https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1",
"https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1",
]
if not Path.is_dir(EMOTION_PATH):
Path.mkdir(EMOTION_PATH)
for url in DOWNLOAD_URLS:
wget.download(url, str(EMOTION_PATH))
The first five lines of the train set are:
i didnt feel humiliated;sadness
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
im grabbing a minute to post i feel greedy wrong;anger
i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
i am feeling grouchy;anger
Here the code to reproduce the issue:
from datasets import Features, Value, ClassLabel, load_dataset
class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)})
file_dict = {'train': EMOTION_PATH/'train.txt'}
dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)
Observed behaviour:
dataset['train'].features
{'text': Value(dtype='string', id=None),
'label': Value(dtype='string', id=None)}
Expected behaviour:
dataset['train'].features
{'text': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}
Things I’ve tried:
- deleting the cache
- trying other types such as
int64
Am I missing anything? Thanks for any pointer in the right direction.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Load - Hugging Face
The load_dataset() function can load each of these file types. CSV. Datasets can read a dataset made up of one or several CSV...
Read more >Huggingface load_dataset() method how to assign the ...
I'm trying to load a custom dataset ...
Read more >My experience with uploading a dataset on HuggingFace's ...
eg: If my dataset is of csv type, I'll start with a similar csv type script and modify it according to my needs....
Read more >5 Different Ways to Load Data in Python - KDnuggets
As a beginner, you might only know a single way to load data (normally in CSV) which is to read it using pandas.read_csv...
Read more >seaborn.load_dataset — seaborn 0.12.1 documentation
Load an example dataset from the online repository (requires internet). This function provides quick access to a small number of example datasets that...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks a lot for the PR and quick fix @lhoestq!
Currently
csv
doesn’t support thefeatures
attribute (unlikejson
). What you can do for now is cast the features using the in-place transformcast_