question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TFX StatisticsGen failing to decode image data

See original GitHub issue

I’ve been trying to adapt the TFX Components.ipynb to work with image data, but when I run the StatisticsGen I get the following warning for each example:

WARNING:root:Feature "image/raw" has bytes value "None" which cannot be decoded as a UTF-8 string. This is causing me two problems.

  1. It floods my Colab cell output causing the notebook to become unresponsive.
  2. I guess it’s not providing accurate statistics for the image data in my TFExamples.

After some digging, I found where the warning was coming from _get_unicode_value() in tensorflow/data-validation.

I thought maybe I was encoding the image data incorrectly so I tried an existing cifar-10 TFRecord from the tfx repo but encountered the same issue.

To reproduce: Open In Colab I’ve added the heading ☹ ☹ ☹ ☹ WARNINGS OCCUR HERE ☹ ☹ ☹ ☹ above the offending cell.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
embrcommented, Apr 20, 2020

Hi Folks,

The warnings you are seeing indicate that StatisticsGen is trying to treat your raw image features like a categorical string feature. The image bytes are being decoded just fine. The issue is that when the stats (including top K examples) are being written, the output proto is expecting a UTF-8 valid string, but instead gets the raw image bytes. Nothing is wrong with your setups from what I can tell, but this is just an unintended side-effect of a well-intentioned warning in the event that you have a categorical string feature which can’t be serialized. We’ll look into finding a better default that handles image data more elegantly.

In the meantime, to tell StatisticsGen that this feature is really an opaque blob, you can pass in a user-modified schema as described in the StatsGen docs. To generate this schema, you can run StatisticsGen and SchemaGen once (on a sample of data) and then modify the inferred schema to annotate that image features. Here is a modified version of the colab from @tall-josh:

Open In Colab

The additional steps are a bit verbose, but having a curated schema is often a good practice for other reasons. Here is the cell that I added to the notebook:

from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2

# Load autogenerated schema (using stats from small batch)

schema = tfx.utils.io_utils.SchemaReader().read(
    tfx.utils.io_utils.get_only_uri_in_dir(
        tfx.types.artifact_utils.get_single_uri(schema_gen.outputs['schema'].get())))

# Modify schema to indicate which string features are images.
# Ideally you would persist a golden version of this schema somewhere rather
# than regenerating it on every run.
for feature in schema.feature:
  if feature.name == 'image/raw':
    feature.image_domain.SetInParent()

# Write modified schema to local file
user_schema_dir ='/tmp/user-schema/'
tfx.utils.io_utils.write_pbtxt_file(
    os.path.join(user_schema_dir, 'schema.pbtxt'), schema)

# Create ImportNode to make modified schema available to other components
user_schema_importer = tfx.components.ImporterNode(
    instance_name='import_user_schema',
    source_uri=user_schema_dir,
    artifact_type=tfx.types.standard_artifacts.Schema)

# Run the user schema ImportNode
context.run(user_schema_importer

Hopefully you find this workaround is useful. In the meantime, we’ll take a look at a better default experience for image-valued features.

1reaction
nikelitecommented, Apr 20, 2020

Thank you for reaching out to us. This issue is reported to the TFX team, and we will take a look.

Read more comments on GitHub >

github_iconTop Results From Across the Web

TFX StatisticsGen for image data - tensorflow - Stack Overflow
I'm trying to use StatisticsGen but I'm receiving this warning; WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded ...
Read more >
Get started with Tensorflow Data Validation | TFX
To fix this, we need to set the default environment for all features to be both 'TRAINING' and 'SERVING', and exclude the 'tips'...
Read more >
Deep Dive into ML Models in Production Using TensorFlow ...
In this tutorial, I'm going to introduce you to TensorFlow Extended, popularly known as TFX. You're going to take an example machine learning ......
Read more >
tfx Changelog - pyup.io
TFX Transform now supports reading raw and materializing transformed data in ... Fixed a compatibility issue with apache-airflow 2.3.0 that is failing with...
Read more >
Category: Code - - Code, music and transience
sha512(str(item).encode('utf-8')).hexdigest()) # utf encoding optional ... an input (image, data) minimize it down to core features and then reverse the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found