question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

load_dataset method returns Unknown split "validation" even if this dir exists

See original GitHub issue

Describe the bug

The datasets.load_dataset returns a ValueError: Unknown split "validation". Should be one of ['train', 'test']. when running load_dataset(local_data_dir_path, split="validation") even if the validation sub-directory exists in the local data path.

The data directories are as follows and attached to this issue:

test_data1
              |_ train
                  |_ 1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ 234.png
                  |_ metadata.jsonl
                  ...
test_data2
              |_ train
                  |_ train_1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ val_234.png
                  |_ metadata.jsonl
                  ...

They contain the same image files and metadata.jsonl but the images in test_data2 have the split names prepended i.e. train_1012.png, val_234.png and the images in test_data1 do not have the split names prepended to the image names i.e. 1012.png, 234.png

I actually saw in another issue val was not recognized as a split name but here I would expect the files to take the split from the parent directory name i.e. val should become part of the validation split?

Steps to reproduce the bug

import datasets
datasets.logging.set_verbosity_error()
from datasets import load_dataset, get_dataset_split_names


# the following only finds train, validation and test splits correctly
path = "./test_data1"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)


# the following only finds train and test splits
path = "./test_data2"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)

Expected results

###################### ['train', 'test', 'validation'] ######################
###################### ['train', 'test', 'validation'] ######################

Actual results

Traceback (most recent call last):
  File "test_data_loader.py", line 11, in <module>

    dataset = load_dataset(path, split=spt)
  File "/home/venv/lib/python3.8/site-packages/datasets/load.py", line 1758, in load_dataset
    ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 893, in as_dataset
    datasets = map_nested(
  File "/home/venv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 385, in map_nested
    return function(data_struct)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 924, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 993, in _as_dataset
    dataset_kwargs = ArrowReader(self._cache_dir, self.info).read(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read
    files = self.get_file_instructions(name, instructions, split_infos)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions
    file_instructions = make_file_instructions(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions
    absolute_instructions = instruction.to_absolute(name2len)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in to_absolute
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in <listcomp>
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr
    raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.')
ValueError: Unknown split "validation". Should be one of ['train', 'test'].

Environment info

  • datasets version:
  • Platform: Linux Ubuntu 18.04
  • Python version: 3.8.12
  • PyArrow version: 9.0.0

Data files

test_data1.zip test_data2.zip

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:17 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
shaneactoncommented, Oct 6, 2022

@polinaeterna I have solved the issue. The solution was to call: load_dataset("csv", data_files={split: files}, split=split)

1reaction
mariosaskocommented, Sep 15, 2022

This code indeed behaves as expected on main. But suppose the val_234.png is renamed to some other value not containing one of these keywords, in that case, this issue becomes relevant again because the real cause of it is the order in which we check the predefined split patterns to assign data files to each split - first we assign data files based on filenames, and only if this fails meaning not a single split found (val is not recognized here in the older versions of datasets, which results in an empty validation split), do we assign based on directory names.

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Loading a Dataset - Hugging Face
If you don't provide a split argument to datasets.load_dataset() , this method will return a dictionary containing a datasets for each split in...
Read more >
How to create train, test and validation splits in tensorflow 2.0
The solution looks good, but this method of choosing training, test and Validation subsets does not ensure the data to be stratified. The...
Read more >
Error Message on indexer console - Splunk Community
message = Must specify at least one dimension field if split mode. ... message = The 'action' option value is invalid for this...
Read more >
Working with EPM Automate for Oracle Enterprise ...
Methods to be Used for Running EPM Automate Using Server-Side Groovy ... Basic authentication works even when OAuth is enabled for an environment....
Read more >
torch_geometric.datasets — pytorch_geometric documentation
Data object and returns a boolean value, indicating whether the data object should ... split (string, optional) – If "train" , loads the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found