question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Checksums didn't match for dataset source files" error while loading openwebtext dataset

See original GitHub issue

Hi, I have encountered this problem during loading the openwebtext dataset:

>>> dataset = load_dataset('openwebtext')
Downloading and preparing dataset openwebtext/plain_text (download: 12.00 GiB, generated: 37.04 GiB, post-processed: Unknown size, total: 49.03 GiB) to /home/admin/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/5c636399c7155da97c982d0d70ecdce30fbca66a4eb4fc768ad91f8331edac02...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/load.py", line 611, in load_dataset
    ignore_verifications=ignore_verifications,
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/builder.py", line 476, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/builder.py", line 536, in _download_and_prepare
    self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files"
  File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/utils/info_utils.py", line 39, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://zenodo.org/record/3834942/files/openwebtext.tar.xz']

I think this problem is caused because the released dataset has changed. Or I should download the dataset manually?

Sorry for release the unfinised issue by mistake.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
RylanSchaeffercommented, Oct 10, 2021

Did anyone figure out how to fix this error?

1reaction
OtwellResearchcommented, Feb 17, 2022

Says fixed but I’m still getting it.

command:

dataset = load_dataset(“ted_talks_iwslt”, language_pair=(“en”, “es”), year=“2014”,download_mode=“force_redownload”)

got:

Using custom data configuration en_es_2014-35a2d3350a0f9823 Downloading and preparing dataset ted_talks_iwslt/en_es_2014 (download: 2.15 KiB, generated: Unknown size, post-processed: Unknown size, total: 2.15 KiB) to /home/ken/.cache/huggingface/datasets/ted_talks_iwslt/en_es_2014-35a2d3350a0f9823/1.1.0/43935b3fe470c753a023642e1f54b068c590847f9928bd3f2ec99f15702ad6a6… Downloading: 2.21k/? [00:00<00:00, 141kB/s]

NonMatchingChecksumError: Checksums didn’t match for dataset source files: [‘https://drive.google.com/u/0/uc?id=1Cz1Un9p8Xn9IpEMMrg2kXSDt0dnjxc4z&export=download’]

Read more comments on GitHub >

github_iconTop Results From Across the Web

head_qa · "Checksums didn't match for dataset source files"
Hi there! I am facing this issue when downloading the dataset using the example script: datasets.utils.info_utils.
Read more >
Checksum error in Huggingface datasets - Google Groups
I got the following error: datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:.
Read more >
huggingface Glue SST2 dataset loading problem?? - Kaggle
I am working on GLUE SST2 dataset and trying to load from datasets nlp.load_dataset() Error: Checksums didn't match for dataset source files… Any...
Read more >
OpenWebText Dataset - Papers With Code
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least...
Read more >
Cerebras Developer Documentation
VTS improves performance for dataset with variable length sequence. ... which CUDA libraries did not load correctly.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found