question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Model not training beyond 1st epoch

See original GitHub issue

Environment info

  • transformers version: 4.4.0.dev0
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.7.0+cu101 (True)
  • Tensorflow version (GPU?): 2.4.1 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No (Single GPU) --> COLAB

Who can help

Models:

Information

Model I am using (Bert, XLNet …): RoBERTa

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

First off, this issue is basically a continuation of #10055 but since that error was mostly resolved, I have thus opened another issue. I am using a private dataset, so I am not at liberty to share it. However, I can provide a clue as to how the csv looks like:-


,ID,Text,Label
......................
Id_1, "Lorem Ipsum", 14

This is the code:-


!git clone https://github.com/huggingface/transformers.git
!cd transformers
!pip install -e .

train_text = list(train['Text'].values)
train_label = list(train['Label'].values)

val_text = list(val['Text'].values)
val_label = list(val['Label'].values)

from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = TFRobertaForSequenceClassification.from_pretrained('roberta-base')

train_encodings = tokenizer(train_text, truncation=True, padding=True)
val_encodings = tokenizer(val_text, truncation=True, padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_label
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_label
))

#----------------------------------------------------------------------------------------------------------------------
#Since The trainer does not work, I will use the native one
from transformers import TFTrainingArguments, TFTrainer

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

with training_args.strategy.scope():
    model = TFRobertaForSequenceClassification.from_pretrained("roberta-base")

trainer = TFTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()
#----------------------------------------------------------------------------------------------------------------------
#Using Native Tensorflow 

from transformers import TFRobertaForSequenceClassification
import tensorflow as tf

model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=1)

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-18)

loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy']) # can also use any keras loss fn
model.fit(train_dataset.batch(8), validation_data = val_dataset.batch(64), epochs=15, batch_size=8)

The Problems:

  • Cannot train using the Trainer() method. The cell successfully executes, but it does nothing - does not start training at all. This is not much of a major issue but it may be a factor in this problem.
  • Model does not train more than 1 epoch :—> I have shared this log for you, where you can clearly see that the model does not train beyond 1st epoch; The rest of epochs just do what the first accomplished:-
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Epoch 1/5
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).WARNING:tensorflow:AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f5b14f1b6c8>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f5b323fb2a0> is not a module, class, method, function, traceback, frame, or code object
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f5b14f1b6c8>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f5b323fb2a0> is not a module, class, method, function, traceback, frame, or code object
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

WARNING:tensorflow:AutoGraph could not transform <function wrap at 0x7f5b301d3c80> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function wrap at 0x7f5b301d3c80> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
180/180 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.0022WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
180/180 [==============================] - 150s 589ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077
Epoch 2/5
180/180 [==============================] - 105s 582ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077
Epoch 3/5
180/180 [==============================] - 105s 582ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077

I think the problem may be that the activation function may be wrong. For CategoricalCrossentropy we need a Sigmoid loss but maybe the activation used in my code is not that.

Can anyone tell me how exactly to change the activation function, or maybe other thoughts on the potential problem? I have tried changing the learning rate with no effect.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

7reactions
sguggercommented, Feb 11, 2021

Not very pleased with your reply, please ask someone a question if you are unclear about something rather than trying to just close an issue.

I want to jump in here and let you know that this kind of behavior is inappropriate. @NielsRogge is doing his best to help you here and he is doing this on his own free time. “My model is not training” is very vague and doesn’t seem like a bug, so suggesting to take this on the forums is very appropriate: more people will be able to help you there.

Please respect that this is an open-source project. No one has to help you solve your bug so staying open-mined and kind will go a long way into getting the help you need.

2reactions
NielsRoggecommented, Feb 11, 2021

Could you please post this on the forum, rather than here? The authors of HuggingFace like to keep this place for bugs or feature requests, and they’re more than happy to help you on the forum.

Looking at your code, this seems more like an issue with preparing the data correctly for the model.

Take a look at this example in the docs on how to perform text classification with the Trainer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

why does model stops training after epoch 1 withouut any ...
I'm trying to run retinanet model on google colab with GPU support but after starting training for 1 epoch it quickly completes 1000...
Read more >
One Epoch Is All You Need - arXiv Vanity
In this paper, we suggest to train on a larger dataset for only one epoch unlike the current practice, in which the unsupervised...
Read more >
How does epoch affect accuracy? | Deepchecks
A very big epoch size does not always increase accuracy. After one epoch in a neural network, all of the training data had...
Read more >
Training loss not decrease after certain epochs - Kaggle
I am training a deep neural network, both training and validation loss decrease as expected. But the question is after 80 epochs, both...
Read more >
My model accuracy doesn't change after first epoch
Simply change the loss to mse (Mean square error) as accuracy is not loss we should be using in regression. – thanatoz. Apr ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found