data preprocess tool using hdf5 or tfrecord
See original GitHub issue🚀 Feature
A subpackage or tool using hdf5
or tfrecord
to preprocess data into one single file.
Motivation
In some field like asr or cv, it is not very novel to just use pytorch dataloader because it may cause speed loss in online data process like making fbank feature(asr)
or some transforms(cv)
. And hdf5 or tfrecord can be a good choice to avoid IO bottleneck and cpu bottleneck. And I think it could be much helpful that our project can have a sub package or tool to do that—either write and read. And there is a texar-pytorch have made such function see:
https://texar-pytorch.readthedocs.io/en/latest/code/data.html#recorddata
also, dataloder utils should be adapted to this because this may need to use iterable dataset plus using num_workers > 0
in dataloader and the missing of the length of the dataset can be a problem for the training process.
Pitch
the link above can be an example but there still a need for writing and loading var length processed feature(tensor dim like [1, sequence_length, feature_dim])
in using hdf5(this can be a little complex)
I tried to write a little tool for this intention https://github.com/tongjinle123/tfrecord_builder but when I was using it in our project some months ago, I found it hard to use it directly because the iterable dataset is hard to use.
also, there are some awesome tools for this intention like: https://github.com/vahidk/tfrecord
Alternatives
Additional context
It can be much helpful that our project can take this into consideration and please forgive my bad English : ) I hope I have fully expressed my idea.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:8 (4 by maintainers)
Pytorch XLA has first party support for reading tf records now. We should just wrap that.
This issue has been automatically marked as stale because it hasn’t had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!