Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom feature types in `load_dataset` from CSV

See original GitHub issue

I am trying to load a local file with the load_dataset function and I want to predefine the feature types with the features argument. However, the types are always the same independent of the value of features.

I am working with the local files from the emotion dataset. To get the data you can use the following code:

from pathlib import Path
import wget

EMOTION_PATH = Path("./data/emotion")
DOWNLOAD_URLS = [
    "https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1",
    "https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1",
    "https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1",
]

if not Path.is_dir(EMOTION_PATH):
     Path.mkdir(EMOTION_PATH)
for url in DOWNLOAD_URLS:
     wget.download(url, str(EMOTION_PATH))

The first five lines of the train set are:

i didnt feel humiliated;sadness
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
im grabbing a minute to post i feel greedy wrong;anger
i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
i am feeling grouchy;anger

Here the code to reproduce the issue:

from datasets import Features, Value, ClassLabel, load_dataset

class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)})
file_dict = {'train': EMOTION_PATH/'train.txt'}

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)

Observed behaviour:

dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

Expected behaviour:

dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}

Things I’ve tried:

deleting the cache
trying other types such as int64

Am I missing anything? Thanks for any pointer in the right direction.

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

lewtuncommented, Sep 30, 2020

Thanks a lot for the PR and quick fix @lhoestq!

1reaction

lhoestqcommented, Sep 17, 2020

Currently csv doesn’t support the features attribute (unlike json). What you can do for now is cast the features using the in-place transform cast_

from datasets import load_dataset

dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'])
dataset.cast_(emotion_features)

Top Results From Across the Web

Load - Hugging Face

The load_dataset() function can load each of these file types. CSV. Datasets can read a dataset made up of one or several CSV...

Huggingface load_dataset() method how to assign the ...

I'm trying to load a custom dataset ...

My experience with uploading a dataset on HuggingFace's ...

eg: If my dataset is of csv type, I'll start with a similar csv type script and modify it according to my needs....

5 Different Ways to Load Data in Python - KDnuggets

As a beginner, you might only know a single way to load data (normally in CSV) which is to read it using pandas.read_csv...

seaborn.load_dataset — seaborn 0.12.1 documentation

Load an example dataset from the online repository (requires internet). This function provides quick access to a small number of example datasets that...