load_dataset method returns Unknown split "validation" even if this dir exists
See original GitHub issueDescribe the bug
The datasets.load_dataset
returns a ValueError: Unknown split "validation". Should be one of ['train', 'test'].
when running load_dataset(local_data_dir_path, split="validation")
even if the validation
sub-directory exists in the local data path.
The data directories are as follows and attached to this issue:
test_data1
|_ train
|_ 1012.png
|_ metadata.jsonl
...
|_ test
...
|_ validation
|_ 234.png
|_ metadata.jsonl
...
test_data2
|_ train
|_ train_1012.png
|_ metadata.jsonl
...
|_ test
...
|_ validation
|_ val_234.png
|_ metadata.jsonl
...
They contain the same image files and metadata.jsonl
but the images in test_data2
have the split names prepended i.e.
train_1012.png, val_234.png
and the images in test_data1
do not have the split names prepended to the image names i.e. 1012.png, 234.png
I actually saw in another issue val
was not recognized as a split name but here I would expect the files to take the split from the parent directory name i.e. val should become part of the validation split?
Steps to reproduce the bug
import datasets
datasets.logging.set_verbosity_error()
from datasets import load_dataset, get_dataset_split_names
# the following only finds train, validation and test splits correctly
path = "./test_data1"
print("######################", get_dataset_split_names(path), "######################")
dataset_list = []
for spt in ["train", "test", "validation"]:
dataset = load_dataset(path, split=spt)
dataset_list.append(dataset)
# the following only finds train and test splits
path = "./test_data2"
print("######################", get_dataset_split_names(path), "######################")
dataset_list = []
for spt in ["train", "test", "validation"]:
dataset = load_dataset(path, split=spt)
dataset_list.append(dataset)
Expected results
###################### ['train', 'test', 'validation'] ######################
###################### ['train', 'test', 'validation'] ######################
Actual results
Traceback (most recent call last):
File "test_data_loader.py", line 11, in <module>
dataset = load_dataset(path, split=spt)
File "/home/venv/lib/python3.8/site-packages/datasets/load.py", line 1758, in load_dataset
ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory)
File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 893, in as_dataset
datasets = map_nested(
File "/home/venv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 385, in map_nested
return function(data_struct)
File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 924, in _build_single_dataset
ds = self._as_dataset(
File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 993, in _as_dataset
dataset_kwargs = ArrowReader(self._cache_dir, self.info).read(
File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read
files = self.get_file_instructions(name, instructions, split_infos)
File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions
file_instructions = make_file_instructions(
File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions
absolute_instructions = instruction.to_absolute(name2len)
File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in to_absolute
return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in <listcomp>
return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr
raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.')
ValueError: Unknown split "validation". Should be one of ['train', 'test'].
Environment info
datasets
version:- Platform: Linux Ubuntu 18.04
- Python version: 3.8.12
- PyArrow version: 9.0.0
Data files
Issue Analytics
- State:
- Created a year ago
- Comments:17 (8 by maintainers)
Top GitHub Comments
@polinaeterna I have solved the issue. The solution was to call:
load_dataset("csv", data_files={split: files}, split=split)
This code indeed behaves as expected on
main
. But suppose theval_234.png
is renamed to some other value not containing one of these keywords, in that case, this issue becomes relevant again because the real cause of it is the order in which we check the predefined split patterns to assign data files to each split - first we assign data files based on filenames, and only if this fails meaning not a single split found (val
is not recognized here in the older versions ofdatasets
, which results in an emptyvalidation
split), do we assign based on directory names.@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if
data_dir
is specified (or ifload_dataset(data_dir)
is called)?