Trouble with streaming frgfm/imagenette vision dataset with TAR archive
See original GitHub issueLink
https://huggingface.co/datasets/frgfm/imagenette
Description
Hello there 👋
Thanks for the amazing work you’ve done with HF Datasets! I’ve just started playing with it, and managed to upload my first dataset. But for the second one, I’m having trouble with the preview since there is some archive extraction involved 😅
Basically, I get a:
Status code: 400
Exception: NotImplementedError
Message: Extraction protocol for TAR archives like 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead.
I’ve tried several things and checked this issue https://github.com/huggingface/datasets/issues/4181 as well, but no luck so far!
Could you point me in the right direction please? 🙏
Owner
Yes
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Support streaming FLEURS dataset · Issue #4181 - GitHub
If the metadata file is not at the beginning of the file, that makes streaming completely inefficient. In this case the TAR archive...
Read more >tarfile — Read and write tar archive files ... - Python Docs
The tarfile module makes it possible to read and write tar archives, including those using gzip, bz2 and lzma compression. Use the zipfile...
Read more >Use tar to combine multiple files into an archive file - IU KB
In Unix and Unix-like operating systems (such as Linux), you can use the tar command (short for "tape archiving") to combine multiple files...
Read more >libarchive - C library and command-line tools for reading and ...
Support for a variety of archive and compression formats. Robust automatic format detection, including archive/compression combinations such as tar.gz.
Read more >Manipulate the tar archive files to copy or back up a file - IBM
tar reads, writes, and lists archive files. An archive file is a single file that contains one or more files, directories, or both....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @albertvillanova 👋
Thanks, since my last message, I went through the repo of https://huggingface.co/datasets/food101/blob/main/food101.py and managed to get it to work in the end 🙏
Here it is: https://huggingface.co/datasets/frgfm/imagenette
I appreciate you opening an issue to document the process, it might help a few!
Hi @frgfm, streaming a dataset that contains a TAR file requires some tweaks because (contrary to ZIP files), tha TAR archive does not allow random access to any of the contained member files. Instead they have to be accessed sequentially (in the order in which they were put into the TAR file when created) and yielded.
So when iterating over the TAR file content, when an image file is found, we need to yield it (and not keeping it in memory, which will require huge RAM memory for large datasets). But when yielding an image file, we also need to yield with it what we call “metadata”: the class label, and other textual information (for example, for audio files, sometimes we also add info such as the speaker ID, their sex, their age,…).
All this information usually is stored in what we call the metadata file: either a JSON or a CSV/TSV file.
But if this is also inside the TAR archive, we need to find this file in the first place when iterating the TAR archive, so that we already have this information when we find an image file and we can yield the image file and its metadata info.
Therefore:
See an example of this process we recently did for “google/fleurs” (their metadata files for “train” were at the end of the TAR archives, after all audio files): https://huggingface.co/datasets/google/fleurs/discussions/4