question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble with streaming frgfm/imagenette vision dataset with TAR archive

See original GitHub issue

Link

https://huggingface.co/datasets/frgfm/imagenette

Description

Hello there 👋

Thanks for the amazing work you’ve done with HF Datasets! I’ve just started playing with it, and managed to upload my first dataset. But for the second one, I’m having trouble with the preview since there is some archive extraction involved 😅

Basically, I get a:

Status code:   400
Exception:     NotImplementedError
Message:       Extraction protocol for TAR archives like 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead.

I’ve tried several things and checked this issue https://github.com/huggingface/datasets/issues/4181 as well, but no luck so far!

Could you point me in the right direction please? 🙏

Owner

Yes

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
frgfmcommented, Jul 25, 2022

Hi @albertvillanova 👋

Thanks, since my last message, I went through the repo of https://huggingface.co/datasets/food101/blob/main/food101.py and managed to get it to work in the end 🙏

Here it is: https://huggingface.co/datasets/frgfm/imagenette

I appreciate you opening an issue to document the process, it might help a few!

2reactions
albertvillanovacommented, Jul 21, 2022

Hi @frgfm, streaming a dataset that contains a TAR file requires some tweaks because (contrary to ZIP files), tha TAR archive does not allow random access to any of the contained member files. Instead they have to be accessed sequentially (in the order in which they were put into the TAR file when created) and yielded.

So when iterating over the TAR file content, when an image file is found, we need to yield it (and not keeping it in memory, which will require huge RAM memory for large datasets). But when yielding an image file, we also need to yield with it what we call “metadata”: the class label, and other textual information (for example, for audio files, sometimes we also add info such as the speaker ID, their sex, their age,…).

All this information usually is stored in what we call the metadata file: either a JSON or a CSV/TSV file.

But if this is also inside the TAR archive, we need to find this file in the first place when iterating the TAR archive, so that we already have this information when we find an image file and we can yield the image file and its metadata info.

Therefore:

  • either the TAR archive contains the metadata file as the first member when iterating it (something we cannot change as it is done at the creation of the TAR file)
  • or if not, then we need to have the metadata file elsewhere
    • in these cases, what we do (if the dataset license allows it) is:
      • we download the TAR file locally, we extract the metadata file and we host the metadata on the Hub
      • we modify the dataset loading script so that it first downloads the metadata file (and reads it) and only then starts iterating the content of the TAR archive file

See an example of this process we recently did for “google/fleurs” (their metadata files for “train” were at the end of the TAR archives, after all audio files): https://huggingface.co/datasets/google/fleurs/discussions/4

  • we uploaded the metadata file to the Hub
  • we adapted the loading script to use it
Read more comments on GitHub >

github_iconTop Results From Across the Web

Support streaming FLEURS dataset · Issue #4181 - GitHub
If the metadata file is not at the beginning of the file, that makes streaming completely inefficient. In this case the TAR archive...
Read more >
tarfile — Read and write tar archive files ... - Python Docs
The tarfile module makes it possible to read and write tar archives, including those using gzip, bz2 and lzma compression. Use the zipfile...
Read more >
Use tar to combine multiple files into an archive file - IU KB
In Unix and Unix-like operating systems (such as Linux), you can use the tar command (short for "tape archiving") to combine multiple files...
Read more >
libarchive - C library and command-line tools for reading and ...
Support for a variety of archive and compression formats. Robust automatic format detection, including archive/compression combinations such as tar.gz.
Read more >
Manipulate the tar archive files to copy or back up a file - IBM
tar reads, writes, and lists archive files. An archive file is a single file that contains one or more files, directories, or both....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found