[to_tf_dataset] Use Feather for better compatibility with TensorFlow ?
See original GitHub issueTo have better performance in TensorFlow, it is important to provide lists of data files in supported formats. For example sharded TFRecords datasets are extremely performant. This is because tf.data can better leverage parallelism in this case, and load one file at a time in memory.
It seems that using tensorflow_io
we could have something similar for to_tf_dataset
if we provide sharded Feather files: https://www.tensorflow.org/io/api_docs/python/tfio/arrow/ArrowFeatherDataset
Feather is a format almost equivalent to the Arrow IPC Stream format we’re using in datasets
: Feather V2 is equivalent to Arrow IPC File format, which is an extension of the stream format (it has an extra footer). Therefore we could store datasets as Feather instead of Arrow IPC Stream format without breaking the whole library.
Here are a few points to explore
- check the performance of ArrowFeatherDataset in tf.data
- check what would change if we were to switch to Feather if needed, in particular check that those are fine: memory mapping, typing, writing, reading to python objects, etc.
We would also need to implement sharding when loading a dataset (this will be done anyway for #546)
cc @Rocketknight1 @gante feel free to comment in case I missed anything !
I’ll share some files and scripts, so that we can benchmark performance of Feather files with tf.data
Issue Analytics
- State:
- Created a year ago
- Comments:48 (48 by maintainers)
Top GitHub Comments
This has so much potential to be great! Also I think you tagged some poor random dude on the internet whose name is also Joao, lol, edited that for you!
@lhoestq the latest
tfio-nightly
(https://pypi.org/project/tensorflow-io-nightly/) supports direct bytes type. Here’s a notebook demonstrating that: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-tfio-ipynb.The notebook does the following:
tf.data.Dataset
object out of those feather files.Cc: @Rocketknight1 @gante