Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inability to reproduce results of simpletransformer article using electra on esperanto data¶

See original GitHub issue

Describe the bug I cannot reproduce the results of simpletransformer training electra from scratch on esperanto language (https://towardsdatascience.com/understanding-electra-and-training-an-electra-language-model-3d33e3a9660d).

To Reproduce

Launch attached script in a screen with:
CUDA_VISIBLE_DEVICES=0 python run_mlm.py

env: packages in running environment is attached in env.txt.
FYI, I give you the following most important packages: cudatoolkit 11.0.221 h6bb024c_0
simpletransformers 0.60.4 pypi_0 pypi
transformers 4.2.2 pypi_0 pypi
pytorch 1.7.0 py3.8_cuda11.0.221_cudnn8.0.3_0 pytorch
tqdm 4.49.0 pypi_0 pypi
tokenizers 0.9.4 py38_0 huggingface

Expected behavior I expected the code to work, but it throws the error attached in image file error1_2.jpg and erroe2_2.jpg.

You can notice that there is a:nvidia error

usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.

You can notice that there is a user warning:

 /home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected ca
ll of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will re
sult in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

Screenshots
Script used is attached.
You can notice that I removed evaluation during training to accelerate training.
You can notice that there is the following added line as well to avoid warning:

os.environ["TOKENIZERS_PARALLELISM"] = "true"

FYI, Wandb link of training is here:
https://wandb.ai/sam_enac/Esperanto - ELECTRA/runs/xzrqcl7g?workspace=user-sam_enac

You can notice that around global step 436, there is a sudden increase in training loss.

You can notice this warning in-between epoch1 end and epoch 2 starts.

WARNING:root:NaN or Inf found in input tensor.

Finally, the full error is below:

Traceback (most recent call last):
  File "run_mlm.py", line 73, in <module>
    main()
  File "run_mlm.py", line 65, in main
    model.train_model(
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 376, in train_model
    global_step, training_details = self.train(
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 641, in train
    outputs = model(inputs, labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/custom_models/models.py", line 533, in forward
    sampled_tokens = torch.multinomial(sample_probs, 1).view(-1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Desktop (please complete the following information): system=‘Linux’
node=‘dormammu’
release=‘5.4.0-62-generic’
version=‘#70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021’
machine=‘x86_64’
processor=‘x86_64’
gpus: “GeForce RTX 2080 Ti”

N.B:
There are 8 gpus on the server, but I used only one.
In another run, I tried using 6 gpus (n_gpus=6) but the computation was unexpectedly slower and there was also the weird increase in training so I killed the process.

Below are the attached files.

error1_2 error2_2 env.txt run_mlm.txt

This is my first time reporting an issue, please don’t hesitate to tell me if my report is missing something.

Issue Analytics

State:
Created 3 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

ThilinaRajapaksecommented, Feb 13, 2021

“wordpieces_prefix” is not part of the Simple Transformers args so adding it to train_args won’t do anything. It’s from the Huggingface tokenizer, but it shouldn’t really affect anything. I think the reason it’s there on my wandb project but isn’t on yours is because of changes to the Huggingface Transformers library. It shouldn’t affect anything since I didn’t have any issues when I ran the code earlier today.

1reaction

ThilinaRajapaksecommented, Feb 13, 2021

You could also try setting fp16=False just in case this is being caused by some driver/GPU issue

Top Results From Across the Web

Understanding ELECTRA and Training an ELECTRA ...

How does a Transformer Model learn a language? What's new in ELECTRA? How do you train your own language model on a single...

Reproducibility Challenge: ELECTRA (Clark et al. 2020)

A reproduction of the Efficiently Learning an Encoder that Classifies Token Replacements Accurately approach in low-resource NLP settings.

Top 1805 resources for bert models - NLP Hub - Metatext

This is a complete list of resources about Bert Models for your next project in natural language processing. Found 1805 Bert. Let's get...

Language Modeling Specifics - Simple Transformers

Tip: The model code is used to specify the model_type in a Simple Transformers model. ELECTRA ModelsPermalink. The ELECTRA model consists of ...

Spark NLP: State of the Art Natural Language Processing

Quick Start. This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark: $ java -version ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Inability to reproduce results of simpletransformer article using electra on esperanto data¶

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Multiple GPU connected for training, but average utilization per GPU is lower than using single GPU

wandb sweeps accumulates GPU memory