Can't extract files from `.7z` zipfile using `download_and_extract`
See original GitHub issueDescribe the bug
Iβm adding a new dataset which is a .7z
zip file in Google drive and contains 3 json files inside. Iβm able to download the data files using download_and_extract
but after downloading it throws this error:
>>> dataset = load_dataset("./datasets/mantis/")
Using custom data configuration default
Downloading and preparing dataset mantis/default to /Users/bhavitvyamalik/.cache/huggingface/datasets/mantis/default/1.1.0/611affa804ec53e2055a335cc1b8b213bb5a0b5142d919967729d5ee23c6bab4...
Downloading data: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 77.2M/77.2M [00:23<00:00, 3.28MB/s]
/Users/bhavitvyamalik/.cache/huggingface/datasets/downloads/fc3d70123c9de8407587a59aa426c37819cf2bf016795d33270e8a1d558a34e6
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/bhavitvyamalik/Desktop/work/hf/datasets/src/datasets/load.py", line 1745, in load_dataset
use_auth_token=use_auth_token,
File "/Users/bhavitvyamalik/Desktop/work/hf/datasets/src/datasets/builder.py", line 595, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/Users/bhavitvyamalik/Desktop/work/hf/datasets/src/datasets/builder.py", line 690, in _download_and_prepare
) from None
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/bhavitvyamalik/.cache/huggingface/datasets/downloads/fc3d70123c9de8407587a59aa426c37819cf2bf016795d33270e8a1d558a34e6/merged_train.json'
just before generating the splits. I checked fc3d70123c9de8407587a59aa426c37819cf2bf016795d33270e8a1d558a34e6
file and itβs 7z
zip file (similar to downloaded Google drive file) which means it didnβt get unzip. Do I need to unzip it separately and then pass the paths for train,dev,test files in SplitGenerator
?
Environment info
datasets
version: 1.18.4.dev0- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.7.8
- PyArrow version: 5.0.0
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
4 Easy Ways to Open 7z Files: WinZip, 7-Zip, and More
1. Double-click the 7z file. The 7z file is an archive containing 1 or more files, compressed to a smaller size. Youβll need...
Read more >How to Open ZIP/RAR/7Z Files in Windows 10 // Easy & Free!
Here's a super quick and easy Windows 10 tip to show you how to extract files from .zip .rar . 7z and .tar...
Read more >How to Install 7-Zip for Extracting ZIP Files - YouTube
In this video I am going to show How to Install 7Zip on Windows 10 operating system. 7-Zip is open source software. Most...
Read more >Download and extract Zip file - Ivanti Community
I need to dowload and extract files to a specfic directory. Is there a way to silently extract a zip file? I know...
Read more >unzip error "End-of-central-directory signature not found"
Try this excellent solution: jar xvf COCR2_100.zip. steps: drag and drop file into terminal window. use keyboard arrows to navigate to start of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @bhavitvyamalik, thanks for reporting.
Yes, currently we do not support 7zip archive compression: I think we should.
As a workaround, you could uncompress it explicitly, like done in e.g.
samsum
dataset:https://github.com/huggingface/datasets/blob/fedf891a08bfc77041d575fad6c26091bc0fce52/datasets/samsum/samsum.py#L106-L110
Hi @bhavitvyamalik, thanks for your investigation.
On Monday, I started a PR that will eventually close this issue as well: Iβm linking it to this.
Let me know what you think.