LocalDatasetModuleFactoryWithoutScript extracts invalid builder name
See original GitHub issueDescribe the bug
Trying to load a local dataset raises an error indicating that the config builder has to have a name. No error should be reported, since the call is completly valid.
Steps to reproduce the bug
load_dataset("./data/some-dataset/", name="some-name")
Expected results
The dataset should be loaded.
Actual results
Traceback (most recent call last):
File "train_lquad.py", line 19, in <module>
load(tokenize_target_function, tokenize_target_function, {}, tokenizer)
File "train_lquad.py", line 14, in load
dataset = load_dataset("./data/lquad/", name="lquad")
File "/net/pr2/scratch/people/plgapohl/python-3.8.6/lib/python3.8/site-packages/datasets/load.py", line 1708, in load_dataset
builder_instance = load_dataset_builder(
File "/net/pr2/scratch/people/plgapohl/python-3.8.6/lib/python3.8/site-packages/datasets/load.py", line 1560, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/net/pr2/scratch/people/plgapohl/python-3.8.6/lib/python3.8/site-packages/datasets/builder.py", line 269, in __init__
self.config, self.config_id = self._create_builder_config(
File "/net/pr2/scratch/people/plgapohl/python-3.8.6/lib/python3.8/site-packages/datasets/builder.py", line 403, in _create_builder_config
raise ValueError(f"BuilderConfig must have a name, got {builder_config.name}")
ValueError: BuilderConfig must have a name, got
Environment info
datasets
version: 2.2.2- Platform: Linux-4.18.0-348.20.1.el8_5.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.8.6
- PyArrow version: 8.0.0
- Pandas version: 1.4.2
The error is probably in line 795 in load.py:
builder_kwargs = {
"hash": hash,
"data_files": data_files,
"name": os.path.basename(self.path),
"base_path": self.path,
**builder_kwargs,
}
os.path.basename
for a directory returns an empty string, rather than the name of the directory.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Source code for datasets.load - Hugging Face
The module can be imported using its name. ... Type[Metric]]]: """Import a module at module_path and return its main class: - a DatasetBuilder...
Read more >https://patch-diff.githubusercontent.com/raw/huggi...
For example to separate "squad" from "lhoestq/squad" (the builder name would ... A formatter is an object that extracts and formats data from...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mariosasko here we go:
https://github.com/huggingface/datasets/pull/4967
TBH I haven’t tested it yet, but should work, since this is a basic change.
The fix is: