Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

load_dataset for text files not working

See original GitHub issue

Trying the following snippet, I get different problems on Linux and Windows.

dataset = load_dataset("text", data_files="data.txt")
# or 
dataset = load_dataset("text", data_files=["data.txt"])

(ps This example shows that you can use a string as input for data_files, but the signature is Union[Dict, List].)

The problem on Linux is that the script crashes with a CSV error (even though it isn’t a CSV file). On Windows the script just seems to freeze or get stuck after loading the config file.

Linux stack trace:

PyTorch version 1.6.0+cu101 available.
Checking /home/bram/.cache/huggingface/datasets/b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text
Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py
Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/dataset_infos.json
Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.json
Using custom data configuration default
Generating dataset text (/home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7)
Downloading and preparing dataset text/default-0907112cc6cd2a38 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7...
Dataset not on Hf google storage. Downloading and preparing it from source
Downloading took 0.0 min
Checksum Computation took 0.0 min
Unable to verify checksums.
Generating split train
Traceback (most recent call last):
  File "/home/bram/Python/projects/dutch-simplification/utils.py", line 45, in prepare_data
    dataset = load_dataset("text", data_files=dataset_f)
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/load.py", line 608, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 468, in download_and_prepare
    self._download_and_prepare(
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 546, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 888, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/tqdm/std.py", line 1130, in __iter__
    for obj in iterable:
  File "/home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py", line 100, in _generate_tables
    pa_table = pac.read_csv(
  File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 2

Windows just seems to get stuck. Even with a tiny dataset of 10 lines, it has been stuck for 15 minutes already at this message:

Checking C:\Users\bramv\.cache\huggingface\datasets\b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text
Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.py
Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text\dataset_infos.json
Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.json
Using custom data configuration default

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:41 (26 by maintainers)

Top GitHub Comments

4reactions

lhoestqcommented, Oct 5, 2020

I found a way to implement it without third party lib and without separator/delimiter logic. Creating a PR now.

I’d love to have your feedback on the PR @Skyy93 , hopefully this is the final iteration of the text dataset 😃

Let me know if it works on your side !

Until there’s a new release you can test it with

from datasets import load_dataset

d = load_dataset("text", data_files=..., script_version="master")

1reaction

iliemihaicommented, Oct 20, 2020

@lhoestq I’ve opened a new issue #743 and added a colab. When do you have free time, can you please look upon it. I would appreciate it 😄. Thank you.

Top Results From Across the Web

Load - Hugging Face

Hugging Face Hub · Local loading script · Local and remote files · Multiprocessing · In-memory data · Offline · Slice splits ·...

Seaborn load_dataset - python - Stack Overflow

load_dataset looks for online csv files on https://github.com/mwaskom/seaborn-data. Here's the docstring: Load a dataset from the online ...

Loading Datasets From Disk — FiftyOne 0.18.0 documentation

Any relative paths in dataset.yaml or per-split TXT files are interpreted relative to the directory containing these files, not your current working ......

Load data from file - MATLAB importdata - MathWorks

Import a Text File and Specify Delimiter and Column Header ... If you do not specify headerlinesIn , the importdata function detects this...

sklearn.datasets.load_files — scikit-learn 1.2.0 documentation

Load text files with categories as subfolder names. ... module to build a feature extraction transformer that suits your problem.