Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

codeparrot/github-code failing to load

See original GitHub issue

Describe the bug

codeparrot/github-code fails to load with a TypeError: get_patterns_in_dataset_repository() missing 1 required positional argument: 'base_path'

Steps to reproduce the bug

from datasets import load_dataset

Expected results

loaded dataset object

Actual results

 [3]: dataset = load_dataset("codeparrot/github-code")
No config specified, defaulting to: github-code/all-all
Downloading and preparing dataset github-code/all-all to /home/bebr/.cache/huggingface/datasets/codeparrot___github-code/all-all/0.0.0/a55513bc0f81db773f9896c7aac225af0cff5b323bb9d2f68124f0a8cc3fb817...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 dataset = load_dataset("codeparrot/github-code")

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/load.py:1679, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1676 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   1678 # Download and prepare data
-> 1679 builder_instance.download_and_prepare(
   1680     download_config=download_config,
   1681     download_mode=download_mode,
   1682     ignore_verifications=ignore_verifications,
   1683     try_from_hf_gcs=try_from_hf_gcs,
   1684     use_auth_token=use_auth_token,
   1685 )
   1687 # Build dataset for splits
   1688 keep_in_memory = (
   1689     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1690 )

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:704, in DatasetBuilder.download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    702         logger.warning("HF google storage unreachable. Downloading and preparing it from source")
    703 if not downloaded_from_gcs:
--> 704     self._download_and_prepare(
    705         dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    706     )
    707 # Sync info
    708 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:1221, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verify_infos)
   1220 def _download_and_prepare(self, dl_manager, verify_infos):
-> 1221     super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:771, in DatasetBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    769 split_dict = SplitDict(dataset_name=self.name)
    770 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 771 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    773 # Checksums verification
    774 if verify_infos and dl_manager.record_checksums:

File ~/.cache/huggingface/modules/datasets_modules/datasets/codeparrot--github-code/a55513bc0f81db773f9896c7aac225af0cff5b323bb9d2f68124f0a8cc3fb817/github-code.py:169, in GithubCode._split_generators(self, dl_manager)
    162 def _split_generators(self, dl_manager):
    164     hfh_dataset_info = HfApi(datasets.config.HF_ENDPOINT).dataset_info(
    165         _REPO_NAME,
    166         timeout=100.0,
    167     )
--> 169     patterns = datasets.data_files.get_patterns_in_dataset_repository(hfh_dataset_info)
    170     data_files = datasets.data_files.DataFilesDict.from_hf_repo(
    171         patterns,
    172         dataset_info=hfh_dataset_info,
    173     )
    175     files = dl_manager.download_and_extract(data_files["train"])

TypeError: get_patterns_in_dataset_repository() missing 1 required positional argument: 'base_path'

Environment info

datasets version: 2.3.2
Platform: Linux-5.18.7-arch1-1-x86_64-with-glibc2.35
Python version: 3.10.5
PyArrow version: 8.0.0
Pandas version: 1.4.2

Issue Analytics

State:
Created a year ago
Comments:8 (6 by maintainers)

Top GitHub Comments

4reactions

lhoestqcommented, Jul 5, 2022

PR is merged, it’s working now ! Closing this one 😃

2reactions

lhoestqcommented, Jul 1, 2022

Good catch ! We recently did a breaking change in get_patterns_in_dataset_repository, I think we can revert it

Top Results From Across the Web

codeparrot/github-code · Datasets at Hugging Face

repo_name (string) language (string) license (string) size (int32) "D4edalus/CoD4x1.8_Server_Pub" "Assembly" "agpl‑3.0" 30,063 "artclarke/humble‑video" "Assembly" "agpl‑3.0" 15,141 "papyrussolution/OpenPapyrus" "Assembly" "agpl‑3.0" 10,258

Leandro von Werra

Can we create all the code for training GitHub CoPilot in a (looong) tweet thread? Yes, see how to train CodeParrot , a...

Notes on Transformers Book Ch. 10 - Christian Mills

CodeParrot. GitHub Repository; CodeParrot is a GPT-2 model trained from scratch on Python code. Large Datasets and Where to ...

Generating Terraform Configuration Files with Large ...

training data. GPT-Code-Clippy and CodeParrot[15] are GPT-2 models trained on Python files from GitHub. In comparison to other models ...

The Stack: 3 TB of permissively licensed source code

telligence (AI)–not only for natural language processing but also for code ... GitHub-Code dataset under the CodeParrot project.