question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom feature types in `load_dataset` from CSV

See original GitHub issue

I am trying to load a local file with the load_dataset function and I want to predefine the feature types with the features argument. However, the types are always the same independent of the value of features.

I am working with the local files from the emotion dataset. To get the data you can use the following code:

from pathlib import Path
import wget

EMOTION_PATH = Path("./data/emotion")
DOWNLOAD_URLS = [
    "https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1",
    "https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1",
    "https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1",
]

if not Path.is_dir(EMOTION_PATH):
     Path.mkdir(EMOTION_PATH)
for url in DOWNLOAD_URLS:
     wget.download(url, str(EMOTION_PATH))

The first five lines of the train set are:

i didnt feel humiliated;sadness
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
im grabbing a minute to post i feel greedy wrong;anger
i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
i am feeling grouchy;anger

Here the code to reproduce the issue:

from datasets import Features, Value, ClassLabel, load_dataset

class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)})
file_dict = {'train': EMOTION_PATH/'train.txt'}

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)

Observed behaviour:

dataset['train'].features
{'text': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

Expected behaviour:

dataset['train'].features
{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}

Things I’ve tried:

  • deleting the cache
  • trying other types such as int64

Am I missing anything? Thanks for any pointer in the right direction.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
lewtuncommented, Sep 30, 2020

Thanks a lot for the PR and quick fix @lhoestq!

1reaction
lhoestqcommented, Sep 17, 2020

Currently csv doesn’t support the features attribute (unlike json). What you can do for now is cast the features using the in-place transform cast_

from datasets import load_dataset

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'])
dataset.cast_(emotion_features)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Load - Hugging Face
The load_dataset() function can load each of these file types. CSV. Datasets can read a dataset made up of one or several CSV...
Read more >
Huggingface load_dataset() method how to assign the ...
I'm trying to load a custom dataset ...
Read more >
My experience with uploading a dataset on HuggingFace's ...
eg: If my dataset is of csv type, I'll start with a similar csv type script and modify it according to my needs....
Read more >
5 Different Ways to Load Data in Python - KDnuggets
As a beginner, you might only know a single way to load data (normally in CSV) which is to read it using pandas.read_csv...
Read more >
seaborn.load_dataset — seaborn 0.12.1 documentation
Load an example dataset from the online repository (requires internet). This function provides quick access to a small number of example datasets that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found