question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cannot determine what will be the cardinality of the output after applying glue_convert_examples_to_features [TF 2.2.0rcx]

See original GitHub issue

🐛 Bug

Information

With Tensorflow 2.2.0 (2.2.0rc2) we should be able to see the number of entries in the data without looking over them and using tf.data.experimental.cardinality.

One issue that I found is that after applying glue_convert_examples_to_features tf.data.experimental.cardinality is not able to find the total number of entry. I thought first that it was bug in this TF 2.2.0 release candidate.https://github.com/tensorflow/tensorflow/issues/37998.

When using data from tensorflow dataset tf.data.experimental.cardinality is returning the number of event

print(data['train'])
print(tf.data.experimental.cardinality(data['train'])) 
<DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>
tf.Tensor(67349, shape=(), dtype=int64)

Now when I am using Huggingface transformer that modify the structure of the data:

train_dataset = glue_convert_examples_to_features(data['train'], 
                                                  tokenizer, 
                                                  max_length=128, 
                                                  task='sst-2')

print(tf.data.experimental.cardinality(train_dataset))
<FlatMapDataset shapes: ({input_ids: (None,), attention_mask: (None,), token_type_ids: (None,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>
tf.Tensor(-2, shape=(), dtype=int64)

When the input pipeline contains a flat_map, it is generally not possible to statically determine what will be the cardinality of the output from the cardinality from the input. I don’t see any flatmap in this function. I am trying to identify which part of the code is responsible. I am not 100% sure this is a transformer issue.

To reproduce

Steps to reproduce the behavior:

import tensorflow as tf
import tensorflow_datasets

from transformers import (
    BertConfig,
    BertTokenizer,
    TFBertModel,
    TFBertForSequenceClassification,
    glue_convert_examples_to_features,
    glue_processors
)

data, info = tensorflow_datasets.load(name='glue/sst2',
                                      data_dir='/tmp/',
                                      with_info=True)

pretrained_weights = 'bert-base-multilingual-uncased'

# Load tokenizer
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)

# recap of input dataset
print(data['train'])
print(tf.data.experimental.cardinality(data['train']))

# Prepare data for BERT
train_dataset = glue_convert_examples_to_features(data['train'], 
                                                  tokenizer, 
                                                  max_length=128, 
                                                  task='sst-2')

# recap of pre processing dataset
print(train_dataset)
print(tf.data.experimental.cardinality(train_dataset))

Expected behavior

I am expecting tf.data.experimental.cardinality to still be able to report the total number of entries after transforming the data with glue_convert_examples_to_features

Environment info

  • transformers version: 2.8.0
  • Platform: MacOS 0.14.6
  • Python version: 3.7.5
  • Tensorflow version (CPU): 2.2.0rc2 (v2.2.0-rc1-34-ge6e5d6df2a 2.2.0-rc2)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jplucommented, Apr 13, 2020

Good points 😃

  1. I fully agree with you, it is really painful.
  2. If you need to change the content of each dataset (doesn’t matter is preprocessing or not) such as when doing cross-validation, indeed you have to recompute the size.
  3. There is currently a project to fully review and rework the data processing part of the lib, so it should be way more convenient to use once done. Until there, indeed, it is a bit of a mess.
  4. I was not aware of this new tf.data.experimental.assert_cardinality(len_examples) as I did not fully dive in TF 2.2 yet, but looks very interesting, thanks for the hint 😃
  5. Indeed, the size cannot be computed from TFRecords, which is a real problem IMHO. I hope in future releases it will be much easier to get the size of a dataset ^^

I will be happy to rediscuss about that, very sorry to do no have, sorry that I could not find a suitable solution to your issue.

0reactions
tarradecommented, Apr 13, 2020

Hi @jplu,

I understand that using experimental feature of Tensorflow may introduce some instability. This is a faire point and tf.data.experimental.assert_cardinality is only available with TF 2.2.0.

My main points and usecases are: 1- Computing the number of element in a sample is very time consuming since you need to loop over all elements. 2- Normally even if data are coming from tfds.load() you need some cleaning, preprocessing steps or maybe you want to resample you train/test/valid sample. In such case the total number from the metadata (info) will not help since it was changed. This is a normal process is any ML project. 3- In the version of the code was looking at, the length was computed anyway (this doesn’t seems to be the case any more with the big clean up from 3 days ago). This was my main argumentation: you compute for any case the total number of even so why not simply assert the cardinality so any sample produced by glue_convert_examples_to_features will have the total number of event it contain and for free (no computation required). 4- Now tf.data.experimental.assert_cardinality(len_examples) is experimental, require TF 2.2.0 and it the head of the code, the length doesn’t seems to be computed any more. 5- One issue is that I soon as I will store the data preprocessed with glue_convert_examples_to_features as TFRecord files, then the cardinality will be lost.

Conclusion: I will take care of doing the assert of the cardinality in my code and I hope that when TF 2.2.0 will be the main release and cardinality more stable we could rediscuss this topic.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Command: apply | Terraform - HashiCorp Developer
The terraform apply command executes the actions proposed in a Terraform plan to create, update, or destroy infrastructure.
Read more >
Cannot get the outputs when terraform output is run
I am making use of modules. And this is the structure of my files - Provisioner and module are different folders.The main.tf in...
Read more >
Managing Infra with Terraform - Medium
A root module can use outputs to print certain values in the CLI output after running terraform apply . When using remote state,...
Read more >
Infrastructure-As-Code with Terraform, VMware and VMware ...
... named main.tf. This is where you will define the resources you will create. ... We will be using Terraform to deploy a...
Read more >
Terraform: Beyond the Basics with AWS
By using modules that logically correlate to your actual application or infrastructure configuration, you can improve agility and increase ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found