codeparrot/github-code failing to load
See original GitHub issueDescribe the bug
codeparrot/github-code fails to load with a TypeError: get_patterns_in_dataset_repository() missing 1 required positional argument: 'base_path'
Steps to reproduce the bug
from datasets import load_dataset
Expected results
loaded dataset object
Actual results
[3]: dataset = load_dataset("codeparrot/github-code")
No config specified, defaulting to: github-code/all-all
Downloading and preparing dataset github-code/all-all to /home/bebr/.cache/huggingface/datasets/codeparrot___github-code/all-all/0.0.0/a55513bc0f81db773f9896c7aac225af0cff5b323bb9d2f68124f0a8cc3fb817...
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 dataset = load_dataset("codeparrot/github-code")
File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/load.py:1679, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
1676 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
1678 # Download and prepare data
-> 1679 builder_instance.download_and_prepare(
1680 download_config=download_config,
1681 download_mode=download_mode,
1682 ignore_verifications=ignore_verifications,
1683 try_from_hf_gcs=try_from_hf_gcs,
1684 use_auth_token=use_auth_token,
1685 )
1687 # Build dataset for splits
1688 keep_in_memory = (
1689 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
1690 )
File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:704, in DatasetBuilder.download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
702 logger.warning("HF google storage unreachable. Downloading and preparing it from source")
703 if not downloaded_from_gcs:
--> 704 self._download_and_prepare(
705 dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
706 )
707 # Sync info
708 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())
File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:1221, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verify_infos)
1220 def _download_and_prepare(self, dl_manager, verify_infos):
-> 1221 super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:771, in DatasetBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
769 split_dict = SplitDict(dataset_name=self.name)
770 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 771 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
773 # Checksums verification
774 if verify_infos and dl_manager.record_checksums:
File ~/.cache/huggingface/modules/datasets_modules/datasets/codeparrot--github-code/a55513bc0f81db773f9896c7aac225af0cff5b323bb9d2f68124f0a8cc3fb817/github-code.py:169, in GithubCode._split_generators(self, dl_manager)
162 def _split_generators(self, dl_manager):
164 hfh_dataset_info = HfApi(datasets.config.HF_ENDPOINT).dataset_info(
165 _REPO_NAME,
166 timeout=100.0,
167 )
--> 169 patterns = datasets.data_files.get_patterns_in_dataset_repository(hfh_dataset_info)
170 data_files = datasets.data_files.DataFilesDict.from_hf_repo(
171 patterns,
172 dataset_info=hfh_dataset_info,
173 )
175 files = dl_manager.download_and_extract(data_files["train"])
TypeError: get_patterns_in_dataset_repository() missing 1 required positional argument: 'base_path'
Environment info
datasets
version: 2.3.2- Platform: Linux-5.18.7-arch1-1-x86_64-with-glibc2.35
- Python version: 3.10.5
- PyArrow version: 8.0.0
- Pandas version: 1.4.2
Issue Analytics
- State:
- Created a year ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
codeparrot/github-code · Datasets at Hugging Face
repo_name (string) language (string) license (string) size (int32)
"D4edalus/CoD4x1.8_Server_Pub" "Assembly" "agpl‑3.0" 30,063
"artclarke/humble‑video" "Assembly" "agpl‑3.0" 15,141
"papyrussolution/OpenPapyrus" "Assembly" "agpl‑3.0" 10,258
Read more >Leandro von Werra
Can we create all the code for training GitHub CoPilot in a (looong) tweet thread? Yes, see how to train CodeParrot , a...
Read more >Notes on Transformers Book Ch. 10 - Christian Mills
CodeParrot. GitHub Repository; CodeParrot is a GPT-2 model trained from scratch on Python code. Large Datasets and Where to ...
Read more >Generating Terraform Configuration Files with Large ...
training data. GPT-Code-Clippy and CodeParrot[15] are GPT-2 models trained on Python files from GitHub. In comparison to other models ...
Read more >The Stack: 3 TB of permissively licensed source code
telligence (AI)–not only for natural language processing but also for code ... GitHub-Code dataset under the CodeParrot project.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
PR is merged, it’s working now ! Closing this one 😃
Good catch ! We recently did a breaking change in
get_patterns_in_dataset_repository
, I think we can revert it