cannot determine what will be the cardinality of the output after applying glue_convert_examples_to_features [TF 2.2.0rcx]
See original GitHub issue🐛 Bug
Information
With Tensorflow 2.2.0 (2.2.0rc2) we should be able to see the number of entries in the data without looking over them and using tf.data.experimental.cardinality.
One issue that I found is that after applying glue_convert_examples_to_features
tf.data.experimental.cardinality is not able to find the total number of entry. I thought first that it was bug in this TF 2.2.0 release candidate.https://github.com/tensorflow/tensorflow/issues/37998.
When using data from tensorflow dataset tf.data.experimental.cardinality is returning the number of event
print(data['train'])
print(tf.data.experimental.cardinality(data['train']))
<DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>
tf.Tensor(67349, shape=(), dtype=int64)
Now when I am using Huggingface transformer that modify the structure of the data:
train_dataset = glue_convert_examples_to_features(data['train'],
tokenizer,
max_length=128,
task='sst-2')
print(tf.data.experimental.cardinality(train_dataset))
<FlatMapDataset shapes: ({input_ids: (None,), attention_mask: (None,), token_type_ids: (None,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>
tf.Tensor(-2, shape=(), dtype=int64)
When the input pipeline contains a flat_map, it is generally not possible to statically determine what will be the cardinality of the output from the cardinality from the input. I don’t see any flatmap in this function. I am trying to identify which part of the code is responsible. I am not 100% sure this is a transformer issue.
To reproduce
Steps to reproduce the behavior:
import tensorflow as tf
import tensorflow_datasets
from transformers import (
BertConfig,
BertTokenizer,
TFBertModel,
TFBertForSequenceClassification,
glue_convert_examples_to_features,
glue_processors
)
data, info = tensorflow_datasets.load(name='glue/sst2',
data_dir='/tmp/',
with_info=True)
pretrained_weights = 'bert-base-multilingual-uncased'
# Load tokenizer
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)
# recap of input dataset
print(data['train'])
print(tf.data.experimental.cardinality(data['train']))
# Prepare data for BERT
train_dataset = glue_convert_examples_to_features(data['train'],
tokenizer,
max_length=128,
task='sst-2')
# recap of pre processing dataset
print(train_dataset)
print(tf.data.experimental.cardinality(train_dataset))
Expected behavior
I am expecting tf.data.experimental.cardinality to still be able to report the total number of entries after transforming the data with glue_convert_examples_to_features
Environment info
transformers
version: 2.8.0- Platform: MacOS 0.14.6
- Python version: 3.7.5
- Tensorflow version (CPU): 2.2.0rc2 (v2.2.0-rc1-34-ge6e5d6df2a 2.2.0-rc2)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (4 by maintainers)
Top GitHub Comments
Good points 😃
tf.data.experimental.assert_cardinality(len_examples)
as I did not fully dive in TF 2.2 yet, but looks very interesting, thanks for the hint 😃I will be happy to rediscuss about that, very sorry to do no have, sorry that I could not find a suitable solution to your issue.
Hi @jplu,
I understand that using experimental feature of Tensorflow may introduce some instability. This is a faire point and
tf.data.experimental.assert_cardinality
is only available with TF 2.2.0.My main points and usecases are: 1- Computing the number of element in a sample is very time consuming since you need to loop over all elements. 2- Normally even if data are coming from
tfds.load()
you need some cleaning, preprocessing steps or maybe you want to resample you train/test/valid sample. In such case the total number from the metadata (info) will not help since it was changed. This is a normal process is any ML project. 3- In the version of the code was looking at, the length was computed anyway (this doesn’t seems to be the case any more with the big clean up from 3 days ago). This was my main argumentation: you compute for any case the total number of even so why not simply assert the cardinality so any sample produced byglue_convert_examples_to_features
will have the total number of event it contain and for free (no computation required). 4- Nowtf.data.experimental.assert_cardinality(len_examples)
is experimental, require TF 2.2.0 and it the head of the code, the length doesn’t seems to be computed any more. 5- One issue is that I soon as I will store the data preprocessed withglue_convert_examples_to_features
as TFRecord files, then the cardinality will be lost.Conclusion: I will take care of doing the assert of the cardinality in my code and I hope that when TF 2.2.0 will be the main release and cardinality more stable we could rediscuss this topic.