question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

codeparrot/github-code failing to load

See original GitHub issue

Describe the bug

codeparrot/github-code fails to load with a TypeError: get_patterns_in_dataset_repository() missing 1 required positional argument: 'base_path'

Steps to reproduce the bug

from datasets import load_dataset

Expected results

loaded dataset object

Actual results

 [3]: dataset = load_dataset("codeparrot/github-code")
No config specified, defaulting to: github-code/all-all
Downloading and preparing dataset github-code/all-all to /home/bebr/.cache/huggingface/datasets/codeparrot___github-code/all-all/0.0.0/a55513bc0f81db773f9896c7aac225af0cff5b323bb9d2f68124f0a8cc3fb817...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 dataset = load_dataset("codeparrot/github-code")

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/load.py:1679, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1676 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   1678 # Download and prepare data
-> 1679 builder_instance.download_and_prepare(
   1680     download_config=download_config,
   1681     download_mode=download_mode,
   1682     ignore_verifications=ignore_verifications,
   1683     try_from_hf_gcs=try_from_hf_gcs,
   1684     use_auth_token=use_auth_token,
   1685 )
   1687 # Build dataset for splits
   1688 keep_in_memory = (
   1689     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1690 )

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:704, in DatasetBuilder.download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    702         logger.warning("HF google storage unreachable. Downloading and preparing it from source")
    703 if not downloaded_from_gcs:
--> 704     self._download_and_prepare(
    705         dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    706     )
    707 # Sync info
    708 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:1221, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verify_infos)
   1220 def _download_and_prepare(self, dl_manager, verify_infos):
-> 1221     super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)

File ~/miniconda3/envs/fastapi-kube/lib/python3.10/site-packages/datasets/builder.py:771, in DatasetBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    769 split_dict = SplitDict(dataset_name=self.name)
    770 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 771 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    773 # Checksums verification
    774 if verify_infos and dl_manager.record_checksums:

File ~/.cache/huggingface/modules/datasets_modules/datasets/codeparrot--github-code/a55513bc0f81db773f9896c7aac225af0cff5b323bb9d2f68124f0a8cc3fb817/github-code.py:169, in GithubCode._split_generators(self, dl_manager)
    162 def _split_generators(self, dl_manager):
    164     hfh_dataset_info = HfApi(datasets.config.HF_ENDPOINT).dataset_info(
    165         _REPO_NAME,
    166         timeout=100.0,
    167     )
--> 169     patterns = datasets.data_files.get_patterns_in_dataset_repository(hfh_dataset_info)
    170     data_files = datasets.data_files.DataFilesDict.from_hf_repo(
    171         patterns,
    172         dataset_info=hfh_dataset_info,
    173     )
    175     files = dl_manager.download_and_extract(data_files["train"])

TypeError: get_patterns_in_dataset_repository() missing 1 required positional argument: 'base_path'

Environment info

  • datasets version: 2.3.2
  • Platform: Linux-5.18.7-arch1-1-x86_64-with-glibc2.35
  • Python version: 3.10.5
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
lhoestqcommented, Jul 5, 2022

PR is merged, it’s working now ! Closing this one 😃

2reactions
lhoestqcommented, Jul 1, 2022

Good catch ! We recently did a breaking change in get_patterns_in_dataset_repository, I think we can revert it

Read more comments on GitHub >

github_iconTop Results From Across the Web

codeparrot/github-code · Datasets at Hugging Face
repo_name (string) language (string) license (string) size (int32) "D4edalus/CoD4x1.8_Server_Pub" "Assembly" "agpl‑3.0" 30,063 "artclarke/humble‑video" "Assembly" "agpl‑3.0" 15,141 "papyrussolution/OpenPapyrus" "Assembly" "agpl‑3.0" 10,258
Read more >
Leandro von Werra
Can we create all the code for training GitHub CoPilot in a (looong) tweet thread? Yes, see how to train CodeParrot , a...
Read more >
Notes on Transformers Book Ch. 10 - Christian Mills
CodeParrot. GitHub Repository; CodeParrot is a GPT-2 model trained from scratch on Python code. Large Datasets and Where to ...
Read more >
Generating Terraform Configuration Files with Large ...
training data. GPT-Code-Clippy and CodeParrot[15] are GPT-2 models trained on Python files from GitHub. In comparison to other models ...
Read more >
The Stack: 3 TB of permissively licensed source code
telligence (AI)–not only for natural language processing but also for code ... GitHub-Code dataset under the CodeParrot project.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found