training new BERT seems not working
See original GitHub issueI tried to train a BERT mode from scratch by “run_lm_finetuning.py” with toy training data (samples/sample.txt) by changing the following:
#model = BertForPreTraining.from_pretrained(args.bert_model)
bert_config = BertConfig.from_json_file('bert_config.json')
model = BertForPreTraining(bert_config)
where the json file comes from BERT-Base, Multilingual Cased
To check the correctness of training, I printed the scores of sequential relationship (for predicting next sentence tasks) in the “pytorch_pretrained_bert/modeling.py”
prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
print(seq_relationship_score)
And the result was (just picking an example from a single batch).
Tensor([[-0.1078, -0.2696],
[-0.1425, -0.3207],
[-0.0179, -0.2271],
[-0.0260, -0.2963],
[-0.1410, -0.2506],
[-0.0566, -0.3013],
[-0.0874, -0.3330],
[-0.1568, -0.2580],
[-0.0144, -0.3072],
[-0.1527, -0.3178],
[-0.1288, -0.2998],
[-0.0439, -0.3267],
[-0.0641, -0.2566],
[-0.1496, -0.3696],
[ 0.0286, -0.2495],
[-0.0922, -0.3002]], device=‘cuda:0’, grad_fn=AddmmBackward)
Notice since the scores for the first column were higher than for the second column, the result showed that the models predicted all batch as not next sentence or next sentence. And this result was universal for all batches. I feel this shouldn’t be the case.
Issue Analytics
- State:
- Created 5 years ago
- Comments:17 (8 by maintainers)

Top Related StackOverflow Question
Hi guys,
Btw, there is an article on this topic http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/
I was wondering, maybe someone tried tweaking some parameters in the transformer, so that it could converge much faster (ofc, maybe at the expense of accuracy), i.e.:
Personally for me the allure of transformer is not really about the state-of-the-art accuracy, but about having the same architecture applicable for any sort of NLP task (i.e. QA tasks or SQUAD like objectives may require a custom engineering or some non-transferrable models).
Hi @thomwolf,
I trained the model for an hour but the loss is always around 0.6-0.8 and never converges. I know it’s computationally expensive to train the BERT; that’s why I choose the very small dataset (sample.txt, which only has 36 lines).
The main issue is that I have tried the same dataset with the original tensorflow version BERT and it converges within 5 minutes:
That’s why I’m wondering if something is wrong with the model. I have also checked the output of each forward step, and found out that the encoder_layers have similar row values, i.e. rows in the matrix “encoder_layers” are similar to each other.
encoded_layers = self.encoder(embedding_output, extended_attention_mask, output_all_encoded_layers=output_all_encoded_layers)