question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Model consuming RaggedTensors fails during evaluation in a distributed setting

See original GitHub issue

Please go to TF Forum for help and support:

https://discuss.tensorflow.org/tag/keras

If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with the documentation (for small docs fixes please send a PR instead). The form below must be filled out.

Here’s why we have that policy:.

Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Colab and Debian 10
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 2.6.0
  • Python version:
  • Bazel version (if compiling from source):
  • GPU model and memory: V100 (16 GB)
  • Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with: python -c “import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)”

Describe the problem.

We have a model that consumes multiple ragged tensors in a batch. Our model runs perfectly fine on a single GPU. But the moment we introduce distributed training, its evaluation fails.

Note that the training during the distributed settings proceeds smoothly but it’s during the evaluation it fails. Since we cannot provide the original data and model, we are using we are providing a minimal snippet in the following notebook that reproduces the issue. You can use Colab to reproduce the issue as well as a multi-GPU machine. We have verified on both and the issue persists.

Describe the current behavior.

Model consuming RaggedTensors fails during evaluation in a distributed setting.

Describe the expected behavior.

The model should run during evaluation without any errors when exposed to a distributed setting.

Contributing.

  • Do you want to contribute a PR? (yes/no): No.
  • If yes, please read this page for instructions
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue.

Colab Notebook: https://colab.research.google.com/drive/1U9oeed5OMAH1KvN5T455kAsB2Nsh1-KF?usp=sharing.

Source code / logs.

ValueError: in user code:

    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:1330 test_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:1319 step_function  **
        data = next(iterator)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:693 __next__
        return self.get_next()
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:744 get_next
        self, self._strategy, return_per_replica=False)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:611 _get_next_as_optional
        iterator._iterators[i].get_next_as_list())  # pylint: disable=protected-access
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1990 get_next_as_list
        strict=True,
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/deprecation.py:549 new_func
        return func(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py:1254 cond
        return cond_v2.cond_v2(pred, true_fn, false_fn, name)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/cond_v2.py:95 cond_v2
        op_return_value=pred)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py:1007 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1989 <lambda>
        lambda: _dummy_tensor_fn(data.element_spec),
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1853 _dummy_tensor_fn
        return nest.map_structure(create_dummy_tensor, value_structure)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py:869 map_structure
        structure[0], [func(*x) for x in entries],
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py:869 <listcomp>
        structure[0], [func(*x) for x in entries],
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/input_lib.py:1849 create_dummy_tensor
        dummy_tensor, (row_splits,) * spec._ragged_rank, validate=False)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:745 from_nested_row_splits
        result = cls.from_row_splits(result, splits, validate=validate)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:454 from_row_splits
        return cls._from_row_partition(values, row_partition, validate=validate)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:348 _from_row_partition
        return cls(values=values, internal=True, row_partition=row_partition)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ragged/ragged_tensor.py:294 __init__
        values.shape.with_rank_at_least(1)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/tensor_shape.py:1078 with_rank_at_least
        raise ValueError("Shape %s must have rank at least %d" % (self, rank))

    ValueError: Shape () must have rank at least 1

Cc: @Nilabhra

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
edlopercommented, Nov 19, 2021

It looks like there’s a bug in the create_dummy_tensor function in distribute/input_lib.py, where it does something fairly nonsensical if the rank of the feature is unknown. I’m not entirely clear on how these “dummy tensors” get used, but my best guess is that this could be fixed on TensorFlow’s end by a change such as this (new lines marked with “NEW”):

[tensorflow/python/distribute/input_lib.py, in create_dummy_tensor]
    if isinstance(spec, ragged_tensor.RaggedTensorSpec):
      if not dims:                                               ## NEW
        dummy_tensor = tf.zeros([0], feature_type)               ## NEW
      row_splits = array_ops.zeros(1, spec._row_splits_dtype)
      dummy_tensor = ragged_tensor.RaggedTensor.from_nested_row_splits(
          dummy_tensor, (row_splits,) * spec._ragged_rank, validate=False)

Alternatively, you could modify your data-loading code to ensure that at least the rank of the input feature is known. E.g., if I change read_ragged_feature in the linked colab to the following definition, then the colab works:

def read_ragged_feature(feature_name, feature, ragged_rank):
    ragged_feature = {}
    ragged_feature[feature_name] = deserialize_composite(
        feature, tf.RaggedTensorSpec(dtype=tf.int32, ragged_rank=ragged_rank),
    )
    ragged_feature[feature_name].flat_values.set_shape([None])  # NEW
    return ragged_feature

(This assumes that you statically know the rank of your input tensors – in this case, I was assuming that there are no “uniform inner” dimensions beyond the ragged dimensions, but you could adjust it if that’s not the case for you. If you don’t statically know the rank of your input tensors, then this won’t help, but I think having unknown ranks for input tensors is fairly rare.)

0reactions
Nilabhracommented, Nov 19, 2021

@edloper Thank you so much for taking the time to work on this bug. I guess I can incorporate either of the solutions your provided, the second one being the easier one to do so. I hope the dev team takes notice of this and patches the bug soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ragged tensors | TensorFlow Core
Ragged tensors are the TensorFlow equivalent of nested variable-length lists. ... Ragged tensors may be passed as inputs to a Keras model by...
Read more >
Compilation for Ragged Tensors with Minimal Padding
This paper presents CORA, a tensor compiler that allows users to easily generate efficient code for ragged tensor operators targeting a wide range...
Read more >
ML Tensor Support — Ray 2.2.0
Datasets has basic support for ragged tensors, namely tensors that are a collection (batch) of variably-shaped subtensors, e.g. a batch of images of...
Read more >
Release 2.12.0
Fix device placement issues related to datasets with ragged tensors of strings (i.e. ... callback for handling distributed training failures & restarts.
Read more >
Why does my model learn with Ragged Tensors but not ...
So turns out the answer was that the shape of the dense tensor was different across the training set and validation set.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found