"Checksums didn't match for dataset source files" error while loading openwebtext dataset
See original GitHub issueHi, I have encountered this problem during loading the openwebtext dataset:
>>> dataset = load_dataset('openwebtext')
Downloading and preparing dataset openwebtext/plain_text (download: 12.00 GiB, generated: 37.04 GiB, post-processed: Unknown size, total: 49.03 GiB) to /home/admin/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/5c636399c7155da97c982d0d70ecdce30fbca66a4eb4fc768ad91f8331edac02...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/load.py", line 611, in load_dataset
ignore_verifications=ignore_verifications,
File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/builder.py", line 476, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/builder.py", line 536, in _download_and_prepare
self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files"
File "/home/admin/workspace/anaconda3/envs/torch1.6-py3.7/lib/python3.7/site-packages/datasets/utils/info_utils.py", line 39, in verify_checksums
raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://zenodo.org/record/3834942/files/openwebtext.tar.xz']
I think this problem is caused because the released dataset has changed. Or I should download the dataset manually?
Sorry for release the unfinised issue by mistake.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:8 (2 by maintainers)
Top Results From Across the Web
head_qa · "Checksums didn't match for dataset source files"
Hi there! I am facing this issue when downloading the dataset using the example script: datasets.utils.info_utils.
Read more >Checksum error in Huggingface datasets - Google Groups
I got the following error: datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:.
Read more >huggingface Glue SST2 dataset loading problem?? - Kaggle
I am working on GLUE SST2 dataset and trying to load from datasets nlp.load_dataset() Error: Checksums didn't match for dataset source files… Any...
Read more >OpenWebText Dataset - Papers With Code
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least...
Read more >Cerebras Developer Documentation
VTS improves performance for dataset with variable length sequence. ... which CUDA libraries did not load correctly.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Did anyone figure out how to fix this error?
Says fixed but I’m still getting it.
command:
dataset = load_dataset(“ted_talks_iwslt”, language_pair=(“en”, “es”), year=“2014”,download_mode=“force_redownload”)
got:
Using custom data configuration en_es_2014-35a2d3350a0f9823 Downloading and preparing dataset ted_talks_iwslt/en_es_2014 (download: 2.15 KiB, generated: Unknown size, post-processed: Unknown size, total: 2.15 KiB) to /home/ken/.cache/huggingface/datasets/ted_talks_iwslt/en_es_2014-35a2d3350a0f9823/1.1.0/43935b3fe470c753a023642e1f54b068c590847f9928bd3f2ec99f15702ad6a6… Downloading: 2.21k/? [00:00<00:00, 141kB/s]
NonMatchingChecksumError: Checksums didn’t match for dataset source files: [‘https://drive.google.com/u/0/uc?id=1Cz1Un9p8Xn9IpEMMrg2kXSDt0dnjxc4z&export=download’]