question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inability to reproduce results of simpletransformer article using electra on esperanto data¶

See original GitHub issue

Describe the bug I cannot reproduce the results of simpletransformer training electra from scratch on esperanto language (https://towardsdatascience.com/understanding-electra-and-training-an-electra-language-model-3d33e3a9660d).

To Reproduce

Launch attached script in a screen with:
CUDA_VISIBLE_DEVICES=0 python run_mlm.py

env: packages in running environment is attached in env.txt.
FYI, I give you the following most important packages: cudatoolkit 11.0.221 h6bb024c_0
simpletransformers 0.60.4 pypi_0 pypi
transformers 4.2.2 pypi_0 pypi
pytorch 1.7.0 py3.8_cuda11.0.221_cudnn8.0.3_0 pytorch
tqdm 4.49.0 pypi_0 pypi
tokenizers 0.9.4 py38_0 huggingface

Expected behavior I expected the code to work, but it throws the error attached in image file error1_2.jpg and erroe2_2.jpg.

You can notice that there is a:nvidia error

usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

You can notice that there is a user warning:

 /home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected ca
ll of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will re
sult in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

Screenshots
Script used is attached.
You can notice that I removed evaluation during training to accelerate training.
You can notice that there is the following added line as well to avoid warning:

os.environ["TOKENIZERS_PARALLELISM"] = "true"

FYI, Wandb link of training is here:
https://wandb.ai/sam_enac/Esperanto - ELECTRA/runs/xzrqcl7g?workspace=user-sam_enac

You can notice that around global step 436, there is a sudden increase in training loss.

You can notice this warning in-between epoch1 end and epoch 2 starts.

WARNING:root:NaN or Inf found in input tensor.

Finally, the full error is below:

Traceback (most recent call last):
  File "run_mlm.py", line 73, in <module>
    main()
  File "run_mlm.py", line 65, in main
    model.train_model(
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 376, in train_model
    global_step, training_details = self.train(
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 641, in train
    outputs = model(inputs, labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/custom_models/models.py", line 533, in forward
    sampled_tokens = torch.multinomial(sample_probs, 1).view(-1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Desktop (please complete the following information): system=‘Linux’
node=‘dormammu’
release=‘5.4.0-62-generic’
version=‘#70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021’
machine=‘x86_64’
processor=‘x86_64’
gpus: “GeForce RTX 2080 Ti”

N.B:
There are 8 gpus on the server, but I used only one.
In another run, I tried using 6 gpus (n_gpus=6) but the computation was unexpectedly slower and there was also the weird increase in training so I killed the process.

Below are the attached files.

error1_2 error2_2 env.txt run_mlm.txt

This is my first time reporting an issue, please don’t hesitate to tell me if my report is missing something.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ThilinaRajapaksecommented, Feb 13, 2021

“wordpieces_prefix” is not part of the Simple Transformers args so adding it to train_args won’t do anything. It’s from the Huggingface tokenizer, but it shouldn’t really affect anything. I think the reason it’s there on my wandb project but isn’t on yours is because of changes to the Huggingface Transformers library. It shouldn’t affect anything since I didn’t have any issues when I ran the code earlier today.

1reaction
ThilinaRajapaksecommented, Feb 13, 2021

You could also try setting fp16=False just in case this is being caused by some driver/GPU issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding ELECTRA and Training an ELECTRA ...
How does a Transformer Model learn a language? What's new in ELECTRA? How do you train your own language model on a single...
Read more >
Reproducibility Challenge: ELECTRA (Clark et al. 2020)
A reproduction of the Efficiently Learning an Encoder that Classifies Token Replacements Accurately approach in low-resource NLP settings.
Read more >
Top 1805 resources for bert models - NLP Hub - Metatext
This is a complete list of resources about Bert Models for your next project in natural language processing. Found 1805 Bert. Let's get...
Read more >
Language Modeling Specifics - Simple Transformers
Tip: The model code is used to specify the model_type in a Simple Transformers model. ELECTRA ModelsPermalink. The ELECTRA model consists of ...
Read more >
Spark NLP: State of the Art Natural Language Processing
Quick Start. This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark: $ java -version ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found