question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[to_tf_dataset] Use Feather for better compatibility with TensorFlow ?

See original GitHub issue

To have better performance in TensorFlow, it is important to provide lists of data files in supported formats. For example sharded TFRecords datasets are extremely performant. This is because tf.data can better leverage parallelism in this case, and load one file at a time in memory.

It seems that using tensorflow_io we could have something similar for to_tf_dataset if we provide sharded Feather files: https://www.tensorflow.org/io/api_docs/python/tfio/arrow/ArrowFeatherDataset

Feather is a format almost equivalent to the Arrow IPC Stream format we’re using in datasets: Feather V2 is equivalent to Arrow IPC File format, which is an extension of the stream format (it has an extra footer). Therefore we could store datasets as Feather instead of Arrow IPC Stream format without breaking the whole library.

Here are a few points to explore

  • check the performance of ArrowFeatherDataset in tf.data
  • check what would change if we were to switch to Feather if needed, in particular check that those are fine: memory mapping, typing, writing, reading to python objects, etc.

We would also need to implement sharding when loading a dataset (this will be done anyway for #546)

cc @Rocketknight1 @gante feel free to comment in case I missed anything !

I’ll share some files and scripts, so that we can benchmark performance of Feather files with tf.data

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:48 (48 by maintainers)

github_iconTop GitHub Comments

4reactions
Rocketknight1commented, Jun 22, 2022

This has so much potential to be great! Also I think you tagged some poor random dude on the internet whose name is also Joao, lol, edited that for you!

2reactions
sayakpaulcommented, Aug 22, 2022

@lhoestq the latest tfio-nightly (https://pypi.org/project/tensorflow-io-nightly/) supports direct bytes type. Here’s a notebook demonstrating that: https://gist.github.com/sayakpaul/f7d5cc312cd01cb31098fad3fd9c6b59#file-feather-v2-tfio-ipynb.

The notebook does the following:

  • Prepares feather files for an image classification dataset representing the images as bytes.
  • Prepares an end-to-end tf.data.Dataset object out of those feather files.

Cc: @Rocketknight1 @gante

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transfer learning and fine-tuning | TensorFlow Core
Transfer learning is usually done for tasks where your dataset has too little data to train a full-scale model from scratch. The most...
Read more >
TensorFlow Data Validation | TFX
An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them. For instructions on...
Read more >
Transfer learning and fine-tuning | TensorFlow Core
In this tutorial, you will learn how to classify images of cats and dogs by using transfer learning from a pre-trained network. A...
Read more >
TensorFlow version compatibility
Graph and checkpoint compatibility when extending TensorFlow ... is increased to X. For example (we're using hypothetical version numbers ...
Read more >
TensorFlow 2.x in TFX
This guide provides a comprehensive technical overview of TF 2.x in TFX. Which version to use? TFX is compatible with TensorFlow 2.x, and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found