question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

training new BERT seems not working

See original GitHub issue

I tried to train a BERT mode from scratch by “run_lm_finetuning.py” with toy training data (samples/sample.txt) by changing the following:

#model = BertForPreTraining.from_pretrained(args.bert_model) bert_config = BertConfig.from_json_file('bert_config.json') model = BertForPreTraining(bert_config)

where the json file comes from BERT-Base, Multilingual Cased

To check the correctness of training, I printed the scores of sequential relationship (for predicting next sentence tasks) in the “pytorch_pretrained_bert/modeling.py” prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output) print(seq_relationship_score)

And the result was (just picking an example from a single batch).

Tensor([[-0.1078, -0.2696],
[-0.1425, -0.3207], [-0.0179, -0.2271], [-0.0260, -0.2963], [-0.1410, -0.2506], [-0.0566, -0.3013], [-0.0874, -0.3330], [-0.1568, -0.2580], [-0.0144, -0.3072], [-0.1527, -0.3178], [-0.1288, -0.2998], [-0.0439, -0.3267], [-0.0641, -0.2566], [-0.1496, -0.3696], [ 0.0286, -0.2495], [-0.0922, -0.3002]], device=‘cuda:0’, grad_fn=AddmmBackward)

Notice since the scores for the first column were higher than for the second column, the result showed that the models predicted all batch as not next sentence or next sentence. And this result was universal for all batches. I feel this shouldn’t be the case.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:17 (8 by maintainers)

github_iconTop GitHub Comments

4reactions
snakers4commented, Jan 25, 2019

Hi guys,

see the paper for TPU training, an estimation is training time using GPUs is about a week using 64 GPUs

Btw, there is an article on this topic http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/

I was wondering, maybe someone tried tweaking some parameters in the transformer, so that it could converge much faster (ofc, maybe at the expense of accuracy), i.e.:

  • Initializing the embedding layer with FastText / your embeddings of choice - in our tests it boosted accuracy and convergence with more plain models;
  • Using a more standard 200 or 300 dimension embedding instead of 768 (also tweaking the hidden size accordingly);

Personally for me the allure of transformer is not really about the state-of-the-art accuracy, but about having the same architecture applicable for any sort of NLP task (i.e. QA tasks or SQUAD like objectives may require a custom engineering or some non-transferrable models).

2reactions
haoyudong-97commented, Jan 18, 2019

Hi @thomwolf,

I trained the model for an hour but the loss is always around 0.6-0.8 and never converges. I know it’s computationally expensive to train the BERT; that’s why I choose the very small dataset (sample.txt, which only has 36 lines).

The main issue is that I have tried the same dataset with the original tensorflow version BERT and it converges within 5 minutes:

next_sentence_accuracy = 1.0 next_sentence_loss = 0.00012585879

That’s why I’m wondering if something is wrong with the model. I have also checked the output of each forward step, and found out that the encoder_layers have similar row values, i.e. rows in the matrix “encoder_layers” are similar to each other. encoded_layers = self.encoder(embedding_output, extended_attention_mask, output_all_encoded_layers=output_all_encoded_layers)

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Train a BERT Model From Scratch
BERT is a powerful NLP model for many language tasks. In this article we will create our own model from scratch and train...
Read more >
pre-trained BERT with Sigmoid not training - Stack Overflow
I am using a pre-trained BERT model from the transformers library to fine-tune for text classification, i.e. two class text classification.
Read more >
How to improve language model ex: BERT on unseen text in ...
The problem comes when I deliberately leave sentences containing more specifically" out of training set, which results in all sentences ...
Read more >
What to do when you get an error - Hugging Face Course
This will prepare you for section 4, where we'll explore how to debug the training phase itself. What to do when you get...
Read more >
How To Train BERT 15x Faster | NLP Summit 2020 - YouTube
Get your Free Spark NLP and Spark OCR Free Trial: https://www.johnsnowlabs.com/spark-nlp-try-free/Register for NLP Summit 2021: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found