Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cannot determine what will be the cardinality of the output after applying glue_convert_examples_to_features [TF 2.2.0rcx]

See original GitHub issue

🐛 Bug

Information

With Tensorflow 2.2.0 (2.2.0rc2) we should be able to see the number of entries in the data without looking over them and using tf.data.experimental.cardinality.

One issue that I found is that after applying glue_convert_examples_to_features tf.data.experimental.cardinality is not able to find the total number of entry. I thought first that it was bug in this TF 2.2.0 release candidate.https://github.com/tensorflow/tensorflow/issues/37998.

When using data from tensorflow dataset tf.data.experimental.cardinality is returning the number of event

print(data['train'])
print(tf.data.experimental.cardinality(data['train']))

<DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>
tf.Tensor(67349, shape=(), dtype=int64)

Now when I am using Huggingface transformer that modify the structure of the data:

train_dataset = glue_convert_examples_to_features(data['train'], 
                                                  tokenizer, 
                                                  max_length=128, 
                                                  task='sst-2')

print(tf.data.experimental.cardinality(train_dataset))

<FlatMapDataset shapes: ({input_ids: (None,), attention_mask: (None,), token_type_ids: (None,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>
tf.Tensor(-2, shape=(), dtype=int64)

When the input pipeline contains a flat_map, it is generally not possible to statically determine what will be the cardinality of the output from the cardinality from the input. I don’t see any flatmap in this function. I am trying to identify which part of the code is responsible. I am not 100% sure this is a transformer issue.

To reproduce

Steps to reproduce the behavior:

import tensorflow as tf
import tensorflow_datasets

from transformers import (
    BertConfig,
    BertTokenizer,
    TFBertModel,
    TFBertForSequenceClassification,
    glue_convert_examples_to_features,
    glue_processors
)

data, info = tensorflow_datasets.load(name='glue/sst2',
                                      data_dir='/tmp/',
                                      with_info=True)

pretrained_weights = 'bert-base-multilingual-uncased'

# Load tokenizer
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)

# recap of input dataset
print(data['train'])
print(tf.data.experimental.cardinality(data['train']))

# Prepare data for BERT
train_dataset = glue_convert_examples_to_features(data['train'], 
                                                  tokenizer, 
                                                  max_length=128, 
                                                  task='sst-2')

# recap of pre processing dataset
print(train_dataset)
print(tf.data.experimental.cardinality(train_dataset))

Expected behavior

I am expecting tf.data.experimental.cardinality to still be able to report the total number of entries after transforming the data with glue_convert_examples_to_features

Environment info

transformers version: 2.8.0
Platform: MacOS 0.14.6
Python version: 3.7.5
Tensorflow version (CPU): 2.2.0rc2 (v2.2.0-rc1-34-ge6e5d6df2a 2.2.0-rc2)

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

jplucommented, Apr 13, 2020

Good points 😃

I fully agree with you, it is really painful.
If you need to change the content of each dataset (doesn’t matter is preprocessing or not) such as when doing cross-validation, indeed you have to recompute the size.
There is currently a project to fully review and rework the data processing part of the lib, so it should be way more convenient to use once done. Until there, indeed, it is a bit of a mess.
I was not aware of this new tf.data.experimental.assert_cardinality(len_examples) as I did not fully dive in TF 2.2 yet, but looks very interesting, thanks for the hint 😃
Indeed, the size cannot be computed from TFRecords, which is a real problem IMHO. I hope in future releases it will be much easier to get the size of a dataset ^^

I will be happy to rediscuss about that, very sorry to do no have, sorry that I could not find a suitable solution to your issue.

0reactions

tarradecommented, Apr 13, 2020

Hi @jplu,

I understand that using experimental feature of Tensorflow may introduce some instability. This is a faire point and tf.data.experimental.assert_cardinality is only available with TF 2.2.0.

My main points and usecases are: 1- Computing the number of element in a sample is very time consuming since you need to loop over all elements. 2- Normally even if data are coming from tfds.load() you need some cleaning, preprocessing steps or maybe you want to resample you train/test/valid sample. In such case the total number from the metadata (info) will not help since it was changed. This is a normal process is any ML project. 3- In the version of the code was looking at, the length was computed anyway (this doesn’t seems to be the case any more with the big clean up from 3 days ago). This was my main argumentation: you compute for any case the total number of even so why not simply assert the cardinality so any sample produced by glue_convert_examples_to_features will have the total number of event it contain and for free (no computation required). 4- Now tf.data.experimental.assert_cardinality(len_examples) is experimental, require TF 2.2.0 and it the head of the code, the length doesn’t seems to be computed any more. 5- One issue is that I soon as I will store the data preprocessed with glue_convert_examples_to_features as TFRecord files, then the cardinality will be lost.

Conclusion: I will take care of doing the assert of the cardinality in my code and I hope that when TF 2.2.0 will be the main release and cardinality more stable we could rediscuss this topic.