question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't extract files from `.7z` zipfile using `download_and_extract`

See original GitHub issue

Describe the bug

I’m adding a new dataset which is a .7z zip file in Google drive and contains 3 json files inside. I’m able to download the data files using download_and_extract but after downloading it throws this error:

>>> dataset = load_dataset("./datasets/mantis/")
Using custom data configuration default
Downloading and preparing dataset mantis/default to /Users/bhavitvyamalik/.cache/huggingface/datasets/mantis/default/1.1.0/611affa804ec53e2055a335cc1b8b213bb5a0b5142d919967729d5ee23c6bab4...
Downloading data: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 77.2M/77.2M [00:23<00:00, 3.28MB/s]
/Users/bhavitvyamalik/.cache/huggingface/datasets/downloads/fc3d70123c9de8407587a59aa426c37819cf2bf016795d33270e8a1d558a34e6
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bhavitvyamalik/Desktop/work/hf/datasets/src/datasets/load.py", line 1745, in load_dataset
    use_auth_token=use_auth_token,
  File "/Users/bhavitvyamalik/Desktop/work/hf/datasets/src/datasets/builder.py", line 595, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/Users/bhavitvyamalik/Desktop/work/hf/datasets/src/datasets/builder.py", line 690, in _download_and_prepare
    ) from None
OSError: Cannot find data file. 
Original error:
[Errno 20] Not a directory: '/Users/bhavitvyamalik/.cache/huggingface/datasets/downloads/fc3d70123c9de8407587a59aa426c37819cf2bf016795d33270e8a1d558a34e6/merged_train.json'

just before generating the splits. I checked fc3d70123c9de8407587a59aa426c37819cf2bf016795d33270e8a1d558a34e6 file and it’s 7z zip file (similar to downloaded Google drive file) which means it didn’t get unzip. Do I need to unzip it separately and then pass the paths for train,dev,test files in SplitGenerator?

Environment info

  • datasets version: 1.18.4.dev0
  • Platform: Darwin-19.6.0-x86_64-i386-64bit
  • Python version: 3.7.8
  • PyArrow version: 5.0.0

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
albertvillanovacommented, Jul 11, 2022

Hi @bhavitvyamalik, thanks for reporting.

Yes, currently we do not support 7zip archive compression: I think we should.

As a workaround, you could uncompress it explicitly, like done in e.g. samsum dataset:

https://github.com/huggingface/datasets/blob/fedf891a08bfc77041d575fad6c26091bc0fce52/datasets/samsum/samsum.py#L106-L110

1reaction
albertvillanovacommented, Jul 13, 2022

Hi @bhavitvyamalik, thanks for your investigation.

On Monday, I started a PR that will eventually close this issue as well: I’m linking it to this.

Let me know what you think.

Read more comments on GitHub >

github_iconTop Results From Across the Web

4 Easy Ways to Open 7z Files: WinZip, 7-Zip, and More
1. Double-click the 7z file. The 7z file is an archive containing 1 or more files, compressed to a smaller size. You’ll need...
Read more >
How to Open ZIP/RAR/7Z Files in Windows 10 // Easy & Free!
Here's a super quick and easy Windows 10 tip to show you how to extract files from .zip .rar . 7z and .tar...
Read more >
How to Install 7-Zip for Extracting ZIP Files - YouTube
In this video I am going to show How to Install 7Zip on Windows 10 operating system. 7-Zip is open source software. Most...
Read more >
Download and extract Zip file - Ivanti Community
I need to dowload and extract files to a specfic directory. Is there a way to silently extract a zip file? I know...
Read more >
unzip error "End-of-central-directory signature not found"
Try this excellent solution: jar xvf COCR2_100.zip. steps: drag and drop file into terminal window. use keyboard arrows to navigate to start of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found