TFX StatisticsGen failing to decode image data
See original GitHub issueI’ve been trying to adapt the TFX Components.ipynb to work with image data, but when I run the StatisticsGen
I get the following warning for each example:
WARNING:root:Feature "image/raw" has bytes value "None" which cannot be decoded as a UTF-8 string.
This is causing me two problems.
- It floods my Colab cell output causing the notebook to become unresponsive.
- I guess it’s not providing accurate statistics for the image data in my TFExamples.
After some digging, I found where the warning was coming from _get_unicode_value() in tensorflow/data-validation.
I thought maybe I was encoding the image data incorrectly so I tried an existing cifar-10 TFRecord from the tfx repo but encountered the same issue.
To reproduce:
I’ve added the heading ☹ ☹ ☹ ☹ WARNINGS OCCUR HERE ☹ ☹ ☹ ☹ above the offending cell.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (1 by maintainers)
Top Results From Across the Web
TFX StatisticsGen for image data - tensorflow - Stack Overflow
I'm trying to use StatisticsGen but I'm receiving this warning; WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded ...
Read more >Get started with Tensorflow Data Validation | TFX
To fix this, we need to set the default environment for all features to be both 'TRAINING' and 'SERVING', and exclude the 'tips'...
Read more >Deep Dive into ML Models in Production Using TensorFlow ...
In this tutorial, I'm going to introduce you to TensorFlow Extended, popularly known as TFX. You're going to take an example machine learning ......
Read more >tfx Changelog - pyup.io
TFX Transform now supports reading raw and materializing transformed data in ... Fixed a compatibility issue with apache-airflow 2.3.0 that is failing with...
Read more >Category: Code - - Code, music and transience
sha512(str(item).encode('utf-8')).hexdigest()) # utf encoding optional ... an input (image, data) minimize it down to core features and then reverse the ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi Folks,
The warnings you are seeing indicate that StatisticsGen is trying to treat your raw image features like a categorical string feature. The image bytes are being decoded just fine. The issue is that when the stats (including top K examples) are being written, the output proto is expecting a UTF-8 valid string, but instead gets the raw image bytes. Nothing is wrong with your setups from what I can tell, but this is just an unintended side-effect of a well-intentioned warning in the event that you have a categorical string feature which can’t be serialized. We’ll look into finding a better default that handles image data more elegantly.
In the meantime, to tell StatisticsGen that this feature is really an opaque blob, you can pass in a user-modified schema as described in the StatsGen docs. To generate this schema, you can run StatisticsGen and SchemaGen once (on a sample of data) and then modify the inferred schema to annotate that image features. Here is a modified version of the colab from @tall-josh:
The additional steps are a bit verbose, but having a curated schema is often a good practice for other reasons. Here is the cell that I added to the notebook:
Hopefully you find this workaround is useful. In the meantime, we’ll take a look at a better default experience for image-valued features.
Thank you for reaching out to us. This issue is reported to the TFX team, and we will take a look.