question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to choose proper download_mode in function load_dataset?

See original GitHub issue

Hi, I am a beginner to datasets and I try to use datasets to load my csv file. my csv file looks like this

text,label
"Effective but too-tepid biopic",3
"If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .",4
"Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .",5

First I try to use this command to load my csv file .

dataset=load_dataset('csv', data_files=['sst_test.csv'])

It seems good, but when i try to overwrite the convert_options to convert ‘label’ columns from int64 to float32 like this.

import pyarrow as pa
from pyarrow import csv
read_options = csv.ReadOptions(block_size=1024*1024)
parse_options = csv.ParseOptions()
convert_options = csv.ConvertOptions(column_types={'text': pa.string(), 'label': pa.float32()})
dataset = load_dataset('csv', data_files=['sst_test.csv'], read_options=read_options,
                       parse_options=parse_options, convert_options=convert_options)

It keeps the same:

Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 2210)

I think this issue is caused by the parameter “download_mode” Default to REUSE_DATASET_IF_EXISTS because after I delete the cache_dir, it seems right.

Is it a bug? How to choose proper download_mode to avoid this issue?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
lhoestqcommented, Feb 16, 2022

It’s no big deal, but since it can be confusing to users I think it’s worth renaming it, and deprecate GenerateMode until datasets 2.0 at least. IMO it’s confusing to have download_mode=GenerateMode.something

1reaction
lhoestqcommented, Oct 28, 2020

download_mode=datasets.GenerateMode.FORCE_REDOWNLOAD should work. This makes me think we we should rename this to DownloadMode.FORCE_REDOWNLOAD. Currently that’s confusing

Read more comments on GitHub >

github_iconTop Results From Across the Web

Loading a Dataset - Hugging Face
load_dataset () function handles already downloaded data by setting its download_mode parameter. By default, download_mode is set to "reuse_dataset_if_exists" .
Read more >
ImportError: cannot import name 'Dataset' · Issue #9631 - GitHub
Anyone. Information. Completely install Transformers + datasets. by pip command. The problem arises when using: When i try to import Lib: from ...
Read more >
NLP Datasets from HuggingFace: How to Access and Train ...
The load_dataset function will do the following. Download and import in the library the file processing script from the Hugging Face GitHub ...
Read more >
by Nabarun Barua | MLearning.ai | Medium
Example taken from Standard Huggingface Dataset Documentation. Step 1: Load the the Dataset from datasets import load_dataset squad = ...
Read more >
i.MX RT Knowledge Base - NXP Community
Select the XIP encrypted mode in the MCUXpresso Secure ... if you want to use normal RT chips, how to achieve speech recognition...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found