question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

Bug - TFBertForSequenceClassification on SQUaD data

See original GitHub issue

šŸ› Bug

Information

Iā€™m using TFBertForSequenceClassification on SQUaD data v1 data.

The problem arises when using:

  • Both official example scripts and my own modified scripts

The tasks I am working on is:

  • an official SQUaD v1 data and my own SQUaD v1 data.

To reproduce

Try 1 - with official squad via tensorflow_datasets.load("squad"), trying to mimic the following official reference -

https://github.com/huggingface/transformers#quick-tour-tf-20-training-and-pytorch-interoperability

import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer, \
    squad_convert_examples_to_features, SquadV1Processor
import tensorflow_datasets

model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

data = tensorflow_datasets.load("squad")
processor = SquadV1Processor()
examples = processor.get_examples_from_dataset(data, evaluate=False)

dataset_features = squad_convert_examples_to_features(examples=examples, tokenizer=tokenizer, max_seq_length=384, doc_stride=128, max_query_length=64, is_training=True, return_dataset='tf')
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)

model.compile(optimizer=opt,
              loss={'start_position': loss_fn, 'end_position': loss_fn},
              loss_weights={'start_position': 1., 'end_position': 1.},
              metrics=['accuracy'])

model.fit(dataset_features, epochs=3)

Stacktrace: - the bug is at the squad_convert_examples_to_features part

convert squad examples to features:   0%|             | 0/10570 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/transformers/data/processors/squad.py", line 95, in squad_convert_example_to_features
    cleaned_answer_text = " ".join(whitespace_tokenize(example.answer_text))
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/transformers/tokenization_bert.py", line 112, in whitespace_tokenize
    text = text.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/examples_git/huggingface_tf_example_squad.py", line 18, in <module>
    dataset_features = squad_convert_examples_to_features(examples=examples, tokenizer=tokenizer, max_seq_length=384, doc_stride=128, max_query_length=64, is_training=True, return_dataset='tf')
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/transformers/data/processors/squad.py", line 327, in squad_convert_examples_to_features
    disable=not tqdm_enabled,
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tqdm/std.py", line 1129, in __iter__
    for obj in iterable:
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/multiprocessing/pool.py", line 320, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
AttributeError: 'NoneType' object has no attribute 'strip'

Try 2 - readine data from file, trying to mimic the following official reference- https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb

import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer, \
    squad_convert_examples_to_features, SquadV1Processor
import tensorflow_datasets

model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

data = tensorflow_datasets.load("squad", data_dir='/data/users/yonatab/zero_shot_data/datasets_refs')
processor = SquadV1Processor()
examples = processor.get_examples_from_dataset(data, evaluate=True)

dataset_features = squad_convert_examples_to_features(examples=examples, tokenizer=tokenizer, max_seq_length=384, doc_stride=128, max_query_length=64, is_training=True, return_dataset='tf')
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)

model.compile(optimizer=opt,
              loss={'start_position': loss_fn, 'end_position': loss_fn},
              loss_weights={'start_position': 1., 'end_position': 1.},
              metrics=['accuracy'])

model.fit(dataset_features, epochs=3)

Stacktrace: - the bug is at the fit method

Traceback (most recent call last):
  File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/examples_git/minimal_example_for_git.py", line 97, in <module>
    main()
  File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/examples_git/minimal_example_for_git.py", line 69, in main
    history = model.fit(tfdataset, epochs=1, steps_per_epoch=3)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 235, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 593, in _process_training_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 706, in _process_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/data_adapter.py", line 702, in __init__
    x = standardize_function(x)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 660, in standardize_function
    standardize(dataset, extract_tensors_from_dataset=False)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2360, in _standardize_user_data
    self._compile_from_inputs(all_inputs, y_input, x, y)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2580, in _compile_from_inputs
    target, self.outputs)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_utils.py", line 1341, in cast_if_floating_dtype_and_mismatch
    if target.dtype != out.dtype:
AttributeError: 'str' object has no attribute 'dtype'

Try 3

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    processor = SquadV1Processor()
    examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
    dataset_features = squad_convert_examples_to_features(examples=examples, tokenizer=tokenizer, max_seq_length=384,
                                                          doc_stride=128, max_query_length=64, is_training=True,
                                                          return_dataset='tf')

    model = TFBertForQuestionAnswering.from_pretrained("bert-base-cased")

    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
    opt = tf.keras.optimizers.Adam(learning_rate=3e-5)

    model.compile(optimizer=opt,
                  loss={'start_position': loss_fn, 'end_position': loss_fn},
                  loss_weights={'start_position': 1., 'end_position': 1.},
                  metrics=['accuracy'])

    history = model.fit(dataset_features, epochs=1)

Stacktrace: - the bug is at the fit method

Traceback (most recent call last):
  File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/examples_git/reading_from_file.py", line 39, in <module>
    main()
  File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/examples_git/reading_from_file.py", line 32, in main
    history = model.fit(dataset_features, epochs=1)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 235, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 593, in _process_training_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 706, in _process_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/data_adapter.py", line 702, in __init__
    x = standardize_function(x)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 660, in standardize_function
    standardize(dataset, extract_tensors_from_dataset=False)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2360, in _standardize_user_data
    self._compile_from_inputs(all_inputs, y_input, x, y)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2580, in _compile_from_inputs
    target, self.outputs)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_utils.py", line 1341, in cast_if_floating_dtype_and_mismatch
    if target.dtype != out.dtype:
AttributeError: 'str' object has no attribute 'dtype'

Try 4 - (after first comment here)

Iā€™m using the code of run_tf_squad.py and instead of the VFTrainer iā€™m trying to use fit. This is the only change I made - same dataset, same examples, same features. Just trying to use fit.

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
    opt = tf.keras.optimizers.Adam(learning_rate=3e-5)

model.compile(optimizer=opt,
                  loss={'output_1': loss_fn, 'output_2': loss_fn},
                  loss_weights={'output_1': 1., 'output_2': 1.},
                  metrics=['accuracy'])

history = model.fit(train_dataset, validation_data=eval_dataset, epochs=1)

And itā€™s the same problem that occurs:

Traceback (most recent call last):
  File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/run_squad_tf.py", line 257, in <module>
    main()
  File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/run_squad_tf.py", line 242, in main
    history = model.fit(train_dataset, validation_data=eval_dataset, epochs=1)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 235, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 593, in _process_training_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 706, in _process_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/data_adapter.py", line 702, in __init__
    x = standardize_function(x)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 660, in standardize_function
    standardize(dataset, extract_tensors_from_dataset=False)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2360, in _standardize_user_data
    self._compile_from_inputs(all_inputs, y_input, x, y)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2580, in _compile_from_inputs
    target, self.outputs)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_utils.py", line 1341, in cast_if_floating_dtype_and_mismatch
    if target.dtype != out.dtype:
AttributeError: 'str' object has no attribute 'dtype'

Expected behavior

I want to be able to use fit on my own squad data.

Environment info

  • transformers version: 2.9.1
  • Platform: Linux
  • Python version: 3.6.6
  • PyTorch version (GPU?): - Using tensorflow
  • Tensorflow version (GPU?): 2.1.0
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Edit: Keras has a new tutorial for it: https://keras.io/examples/nlp/text_extraction_with_bert/

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jplucommented, May 24, 2020

I very quickly coded this so it is not really tested but it can gives you an idea of how to use .fit() method. It is based on the Colab version proposed for the nlp framework.

from transformers import (
    BertTokenizerFast,
    TFBertForQuestionAnswering,
)
import tensorflow_datasets as tfds
import tensorflow as tf

ds = tfds.load("squad")

tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
model = TFBertForQuestionAnswering.from_pretrained("bert-base-cased")

def get_correct_alignement(context, gold_text, start_idx):
    end_idx = start_idx + len(gold_text)
    if context[start_idx:end_idx] == gold_text:
        return start_idx, end_idx       # When the gold label position is good
    elif context[start_idx-1:end_idx-1] == gold_text:
        return start_idx-1, end_idx-1   # When the gold label is off by one character
    elif context[start_idx-2:end_idx-2] == gold_text:
        return start_idx-2, end_idx-2   # When the gold label is off by two character
    else:
        raise ValueError()

def convert_to_tf_features(example, training=True):
   encodings = tokenizer.encode_plus(example["context"].numpy().decode("utf-8"), example["question"].numpy().decode("utf-8"), pad_to_max_length=True, max_length=512)
    start_positions, end_positions = [], []
    
    if training:
      start_idx, end_idx = get_correct_alignement(example["context"].numpy().decode("utf-8"), example["answers"]["text"][0].numpy().decode("utf-8"), example["answers"]["answer_start"][0].numpy())
      start = encodings.char_to_token(0, start_idx)
      end = encodings.char_to_token(0, end_idx-1)
      
      if start is None or end is None:
        return None, None
      
      start_positions.append(start)
      end_positions.append(end)
    else:
      for i, start, text in enumerate(zip(example["answers"]["answer_start"], example["answers"]["text"])):
        start_idx, end_idx = get_correct_alignement(example["context"].numpy().decode("utf-8"), example["context"].numpy().decode("utf-8"), text.numpy().decode("utf-8"), start.numpy())
        
        start = encodings.char_to_token(0, start_idx)
        end = encodings.char_to_token(0, end_idx-1)
        
        if start is None or end is None:
          return None, None
        
        start_positions.append(start)
        end_positions.append(end)
    
    if start_positions and end_positions:
      encodings.update({'output_1': start_positions,
                        'output_2': end_positions})
    
    return encodings, {'output_1': start_positions, 'output_2': end_positions}

train_features = {}
train_labels = {}
for item in ds["train"]:
  feature, label = convert_to_tf_features(item)
  if feature is not None and label is not None:
    for k, v in feature.items():
      train_features.setdefault(k, []).append(v)
    for k, v in label.items():
      train_labels.setdefault(k, []).append(v)

train_tfdataset = tf.data.Dataset.from_tensor_slices((train_features, train_labels)).batch(8)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt,
              loss={'output_1': loss_fn, 'output_2': loss_fn},
              loss_weights={'output_1': 1., 'output_2': 1.},
              metrics=['accuracy'])

model.fit(train_tfdataset, epochs=1, steps_per_epoch=3)
0reactions
Vanpesycommented, Jun 15, 2020

Hiļ¼Œ how did you solve the Try 1 problemļ¼Ÿ AttributeError: ā€˜NoneTypeā€™ object has no attribute ā€˜stripā€™

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tune a pretrained model - Hugging Face
Before you can fine-tune a pretrained model, download a dataset and prepare it for training. The previous tutorial showed you how to process...
Read more >
Training TFBertForSequenceClassification with custom X and ...
However, I am unable to find an example on how to load my own custom data and pass it in model.fit(train_dataset, epochs=2, steps_per_epoch=115,Ā ......
Read more >
Building a QA System with BERT on Wikipedia
A high-level code walk-through of an IR-based QA system with PyTorch and Hugging Face.
Read more >
pip install transformers==2.5.1 - PyPI
run_squad.py : Fine-tuning on SQuAD for question-answering. This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs...
Read more >
How to load the pre-trained BERT model from local/colab ...
FULL ERROR: Model name '/content/drive/My ... bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad,Ā ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found