Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to choose proper download_mode in function load_dataset?

See original GitHub issue

Hi, I am a beginner to datasets and I try to use datasets to load my csv file. my csv file looks like this

text,label
"Effective but too-tepid biopic",3
"If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .",4
"Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .",5

First I try to use this command to load my csv file .

dataset=load_dataset('csv', data_files=['sst_test.csv'])

It seems good, but when i try to overwrite the convert_options to convert ‘label’ columns from int64 to float32 like this.

import pyarrow as pa
from pyarrow import csv
read_options = csv.ReadOptions(block_size=1024*1024)
parse_options = csv.ParseOptions()
convert_options = csv.ConvertOptions(column_types={'text': pa.string(), 'label': pa.float32()})
dataset = load_dataset('csv', data_files=['sst_test.csv'], read_options=read_options,
                       parse_options=parse_options, convert_options=convert_options)

It keeps the same:

Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 2210)

I think this issue is caused by the parameter “download_mode” Default to REUSE_DATASET_IF_EXISTS because after I delete the cache_dir, it seems right.

Is it a bug? How to choose proper download_mode to avoid this issue?

Issue Analytics

State:
Created 3 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

lhoestqcommented, Feb 16, 2022

It’s no big deal, but since it can be confusing to users I think it’s worth renaming it, and deprecate GenerateMode until datasets 2.0 at least. IMO it’s confusing to have download_mode=GenerateMode.something

1reaction

lhoestqcommented, Oct 28, 2020

download_mode=datasets.GenerateMode.FORCE_REDOWNLOAD should work. This makes me think we we should rename this to DownloadMode.FORCE_REDOWNLOAD. Currently that’s confusing

Top Results From Across the Web

Loading a Dataset - Hugging Face

load_dataset () function handles already downloaded data by setting its download_mode parameter. By default, download_mode is set to "reuse_dataset_if_exists" .

ImportError: cannot import name 'Dataset' · Issue #9631 - GitHub

Anyone. Information. Completely install Transformers + datasets. by pip command. The problem arises when using: When i try to import Lib: from ...

NLP Datasets from HuggingFace: How to Access and Train ...

The load_dataset function will do the following. Download and import in the library the file processing script from the Hugging Face GitHub ...

by Nabarun Barua | MLearning.ai | Medium

Example taken from Standard Huggingface Dataset Documentation. Step 1: Load the the Dataset from datasets import load_dataset squad = ...

i.MX RT Knowledge Base - NXP Community

Select the XIP encrypted mode in the MCUXpresso Secure ... if you want to use normal RT chips, how to achieve speech recognition...