question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

{SPAN} doesn't work as expected with GCS

See original GitHub issue

System information

  • Have I specified the code to reproduce the issue (Yes, No): Yes
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Vertex AI Pipeline, Vertex AI Notebook, GCS storage
  • TensorFlow version: 2.6
  • TFX Version: 1.2.0
  • Python version: 3.7
  • Python dependencies (from pip freeze output): None

Describe the current behavior First of all, I have CIFAR10 dataset in the following location

  • gs://cifar10-csp-public/cifar10/span-1/train/train.tfrecord
  • gs://cifar10-csp-public/cifar10/span-1/test/test.tfrecord

With ImportExampleGen as defined below, it failed to get the dataset from the specified pattern paths.


data_path = "gs://cifar10-csp-public"

input_config = example_gen_pb2.Input(splits=[
              example_gen_pb2.Input.Split(name='train',
                                          pattern='cifar10/span-{SPAN}/train/*'),
              example_gen_pb2.Input.Split(name='val',
                                          pattern='cifar10/span-{SPAN}/test/*')
          ])

example_gen = tfx.components.ImportExampleGen(input_base=data_path, input_config=input_config)

As inspecting the logs, it complains the files don’t exist.

OSError: No files found based on the file pattern gs://cifar10-csp-public/cifar10/span-{SPAN}/train/*

Describe the expected behavior

The expected behaviour is that ImportExampleGen can correctly retrieve the data with {SPAN} specified. As it didn’t work as expected, I have tried out the code below

data_path = "gs://cifar10-csp-public"

splits = [
  example_gen_pb2.Input.Split(name='train',pattern='span-{SPAN}/train/*'),
  example_gen_pb2.Input.Split(name='val',pattern='span-{SPAN}/test/*')
]
_, span, version = utils.calculate_splits_fingerprint_span_and_version(data_path, splits)
  
input_config = example_gen_pb2.Input(splits=[
    example_gen_pb2.Input.Split(name='train', pattern=f'span-{span}/train/*'),
    example_gen_pb2.Input.Split(name='val', pattern=f'span-{span}/test/*')
])

example_gen = tfx.components.ImportExampleGen(input_base=data_path, input_config=input_config)

With the utility function calculate_splits_fingerprint_span_and_version, it works fine now. However, I just wonder why it didn’t work in the first place. Doesn’t ImportExampleGen use calculate_splits_fingerprint_span_and_version function internally?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:73 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
1025KBcommented, Oct 6, 2021

FYI, https://github.com/tensorflow/tfx/pull/4347 this PR should fix the {SPAN} for Vertex (KubeflowV2DagRunner)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot External HTTP(S) Load Balancing - Google Cloud
URL doesn't serve expected Cloud Storage object. The Cloud Storage object to serve is determined based on your URL map and the URL...
Read more >
Does height and width not apply to span? - Stack Overflow
Span is an inline element. It has no width or height. You could turn it into a block-level element, then it will accept...
Read more >
airflow.providers.google.cloud.operators.gcs
The time-span is passed to the transform script as third and fourth argument as UTC ISO 8601 string. The transformation script is expected...
Read more >
An Introduction to Optimising Code Using Span<T>
In this post I introduce Span for high-performance C# code situations. ... As we don't specify a length, this slice will run to...
Read more >
Spans and ref part 2 : spans - Marc Gravell
work with generics (unlike pointers, which don't); respect garbage collection (GC) semantics by using references instead of pointers (the GC ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found