Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

data preprocess tool using hdf5 or tfrecord

See original GitHub issue

🚀 Feature

A subpackage or tool using hdf5 or tfrecord to preprocess data into one single file.

Motivation

In some field like asr or cv, it is not very novel to just use pytorch dataloader because it may cause speed loss in online data process like making fbank feature(asr) or some transforms(cv). And hdf5 or tfrecord can be a good choice to avoid IO bottleneck and cpu bottleneck. And I think it could be much helpful that our project can have a sub package or tool to do that—either write and read. And there is a texar-pytorch have made such function see: https://texar-pytorch.readthedocs.io/en/latest/code/data.html#recorddata

also, dataloder utils should be adapted to this because this may need to use iterable dataset plus using num_workers > 0 in dataloader and the missing of the length of the dataset can be a problem for the training process.

Pitch

the link above can be an example but there still a need for writing and loading var length processed feature(tensor dim like [1, sequence_length, feature_dim]) in using hdf5(this can be a little complex)

I tried to write a little tool for this intention https://github.com/tongjinle123/tfrecord_builder but when I was using it in our project some months ago, I found it hard to use it directly because the iterable dataset is hard to use.

also, there are some awesome tools for this intention like: https://github.com/vahidk/tfrecord

Alternatives

Additional context

It can be much helpful that our project can take this into consideration and please forgive my bad English : ) I hope I have fully expressed my idea.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

Skylion007commented, Mar 25, 2020

Pytorch XLA has first party support for reading tf records now. We should just wrap that.

0reactions

stale[bot]commented, Mar 25, 2021

This issue has been automatically marked as stale because it hasn’t had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Top Results From Across the Web

data preprocess tool using hdf5 or tfrecord #1231 - GitHub

A subpackage or tool using hdf5 or tfrecord to preprocess data into one single file. Motivation. In some field like asr or cv,...

Converting from HDF5 to tfrecord and reading ... - Tim Sainburg

First, lets make a quick hdf5 dataset out of fashion-MNIST (which we can import from the tensorflow). To make the dataset diverse, we'll...

What is the difference between storing data in HDF5 vs ...

I take it you're asking about advantages of checkpointing with tensorflow's tf.train.Saver class compared to evaluating the variables and saving them as ...

Guide to File Formats for Machine Learning: Columnar ...

TLDR; Most machine learning models are trained using data from files. This post is a guide to the popular file formats used in...

TFRecord and tf.train.Example | TensorFlow Core

This is an end-to-end example of how to read and write image data using TFRecords. Using an image as input data, you will...