Model not training beyond 1st epoch
See original GitHub issueEnvironment info
transformers
version: 4.4.0.dev0- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.7.0+cu101 (True)
- Tensorflow version (GPU?): 2.4.1 (True)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No (Single GPU) --> COLAB
Who can help
Models:
- albert, bert, xlm: @LysandreJik
- tensorflow: @jplu
- trainer: @sgugger
Information
Model I am using (Bert, XLNet …): RoBERTa
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
First off, this issue is basically a continuation of #10055 but since that error was mostly resolved, I have thus opened another issue. I am using a private dataset, so I am not at liberty to share it. However, I can provide a clue as to how the csv
looks like:-
,ID,Text,Label
......................
Id_1, "Lorem Ipsum", 14
This is the code:-
!git clone https://github.com/huggingface/transformers.git
!cd transformers
!pip install -e .
train_text = list(train['Text'].values)
train_label = list(train['Label'].values)
val_text = list(val['Text'].values)
val_label = list(val['Label'].values)
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
import tensorflow as tf
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = TFRobertaForSequenceClassification.from_pretrained('roberta-base')
train_encodings = tokenizer(train_text, truncation=True, padding=True)
val_encodings = tokenizer(val_text, truncation=True, padding=True)
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_label
))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_label
))
#----------------------------------------------------------------------------------------------------------------------
#Since The trainer does not work, I will use the native one
from transformers import TFTrainingArguments, TFTrainer
training_args = TFTrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
with training_args.strategy.scope():
model = TFRobertaForSequenceClassification.from_pretrained("roberta-base")
trainer = TFTrainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
trainer.train()
#----------------------------------------------------------------------------------------------------------------------
#Using Native Tensorflow
from transformers import TFRobertaForSequenceClassification
import tensorflow as tf
model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=1)
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-18)
loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy']) # can also use any keras loss fn
model.fit(train_dataset.batch(8), validation_data = val_dataset.batch(64), epochs=15, batch_size=8)
The Problems:
- Cannot train using the
Trainer()
method. The cell successfully executes, but it does nothing - does not start training at all. This is not much of a major issue but it may be a factor in this problem. - Model does not train more than 1 epoch :—> I have shared this log for you, where you can clearly see that the model does not train beyond 1st epoch; The rest of epochs just do what the first accomplished:-
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.
Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/5
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).WARNING:tensorflow:AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f5b14f1b6c8>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f5b323fb2a0> is not a module, class, method, function, traceback, frame, or code object
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f5b14f1b6c8>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f5b323fb2a0> is not a module, class, method, function, traceback, frame, or code object
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function wrap at 0x7f5b301d3c80> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function wrap at 0x7f5b301d3c80> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
180/180 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.0022WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
180/180 [==============================] - 150s 589ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077
Epoch 2/5
180/180 [==============================] - 105s 582ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077
Epoch 3/5
180/180 [==============================] - 105s 582ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077
I think the problem may be that the
activation function
may be wrong. ForCategoricalCrossentropy
we need aSigmoid
loss but maybe the activation used in my code is not that.
Can anyone tell me how exactly to change the activation function, or maybe other thoughts on the potential problem? I have tried changing the learning rate with no effect.
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (8 by maintainers)
Top Results From Across the Web
why does model stops training after epoch 1 withouut any ...
I'm trying to run retinanet model on google colab with GPU support but after starting training for 1 epoch it quickly completes 1000...
Read more >One Epoch Is All You Need - arXiv Vanity
In this paper, we suggest to train on a larger dataset for only one epoch unlike the current practice, in which the unsupervised...
Read more >How does epoch affect accuracy? | Deepchecks
A very big epoch size does not always increase accuracy. After one epoch in a neural network, all of the training data had...
Read more >Training loss not decrease after certain epochs - Kaggle
I am training a deep neural network, both training and validation loss decrease as expected. But the question is after 80 epochs, both...
Read more >My model accuracy doesn't change after first epoch
Simply change the loss to mse (Mean square error) as accuracy is not loss we should be using in regression. – thanatoz. Apr ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I want to jump in here and let you know that this kind of behavior is inappropriate. @NielsRogge is doing his best to help you here and he is doing this on his own free time. “My model is not training” is very vague and doesn’t seem like a bug, so suggesting to take this on the forums is very appropriate: more people will be able to help you there.
Please respect that this is an open-source project. No one has to help you solve your bug so staying open-mined and kind will go a long way into getting the help you need.
Could you please post this on the forum, rather than here? The authors of HuggingFace like to keep this place for bugs or feature requests, and they’re more than happy to help you on the forum.
Looking at your code, this seems more like an issue with preparing the data correctly for the model.
Take a look at this example in the docs on how to perform text classification with the Trainer.