Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble with streaming frgfm/imagenette vision dataset with TAR archive

See original GitHub issue

Link

https://huggingface.co/datasets/frgfm/imagenette

Description

Hello there 👋

Thanks for the amazing work you’ve done with HF Datasets! I’ve just started playing with it, and managed to upload my first dataset. But for the second one, I’m having trouble with the preview since there is some archive extraction involved 😅

Basically, I get a:

Status code:   400
Exception:     NotImplementedError
Message:       Extraction protocol for TAR archives like 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead.

I’ve tried several things and checked this issue https://github.com/huggingface/datasets/issues/4181 as well, but no luck so far!

Could you point me in the right direction please? 🙏

Owner

Yes

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

3reactions

frgfmcommented, Jul 25, 2022

Hi @albertvillanova 👋

Thanks, since my last message, I went through the repo of https://huggingface.co/datasets/food101/blob/main/food101.py and managed to get it to work in the end 🙏

Here it is: https://huggingface.co/datasets/frgfm/imagenette

I appreciate you opening an issue to document the process, it might help a few!

2reactions

albertvillanovacommented, Jul 21, 2022

Hi @frgfm, streaming a dataset that contains a TAR file requires some tweaks because (contrary to ZIP files), tha TAR archive does not allow random access to any of the contained member files. Instead they have to be accessed sequentially (in the order in which they were put into the TAR file when created) and yielded.

So when iterating over the TAR file content, when an image file is found, we need to yield it (and not keeping it in memory, which will require huge RAM memory for large datasets). But when yielding an image file, we also need to yield with it what we call “metadata”: the class label, and other textual information (for example, for audio files, sometimes we also add info such as the speaker ID, their sex, their age,…).

All this information usually is stored in what we call the metadata file: either a JSON or a CSV/TSV file.

But if this is also inside the TAR archive, we need to find this file in the first place when iterating the TAR archive, so that we already have this information when we find an image file and we can yield the image file and its metadata info.

Therefore:

either the TAR archive contains the metadata file as the first member when iterating it (something we cannot change as it is done at the creation of the TAR file)
or if not, then we need to have the metadata file elsewhere
- in these cases, what we do (if the dataset license allows it) is:
  - we download the TAR file locally, we extract the metadata file and we host the metadata on the Hub
  - we modify the dataset loading script so that it first downloads the metadata file (and reads it) and only then starts iterating the content of the TAR archive file

See an example of this process we recently did for “google/fleurs” (their metadata files for “train” were at the end of the TAR archives, after all audio files): https://huggingface.co/datasets/google/fleurs/discussions/4

we uploaded the metadata file to the Hub
we adapted the loading script to use it

Top Results From Across the Web

Support streaming FLEURS dataset · Issue #4181 - GitHub

If the metadata file is not at the beginning of the file, that makes streaming completely inefficient. In this case the TAR archive...

tarfile — Read and write tar archive files ... - Python Docs

The tarfile module makes it possible to read and write tar archives, including those using gzip, bz2 and lzma compression. Use the zipfile...

Use tar to combine multiple files into an archive file - IU KB

In Unix and Unix-like operating systems (such as Linux), you can use the tar command (short for "tape archiving") to combine multiple files...

libarchive - C library and command-line tools for reading and ...

Support for a variety of archive and compression formats. Robust automatic format detection, including archive/compression combinations such as tar.gz.

Manipulate the tar archive files to copy or back up a file - IBM

tar reads, writes, and lists archive files. An archive file is a single file that contains one or more files, directories, or both....