Inability to reproduce results of simpletransformer article using electra on esperanto data¶
See original GitHub issueDescribe the bug I cannot reproduce the results of simpletransformer training electra from scratch on esperanto language (https://towardsdatascience.com/understanding-electra-and-training-an-electra-language-model-3d33e3a9660d).
To Reproduce
Launch attached script in a screen with:
CUDA_VISIBLE_DEVICES=0 python run_mlm.py
env: packages in running environment is attached in env.txt.
FYI, I give you the following most important packages:
cudatoolkit 11.0.221 h6bb024c_0
simpletransformers 0.60.4 pypi_0 pypi
transformers 4.2.2 pypi_0 pypi
pytorch 1.7.0 py3.8_cuda11.0.221_cudnn8.0.3_0 pytorch
tqdm 4.49.0 pypi_0 pypi
tokenizers 0.9.4 py38_0 huggingface
Expected behavior I expected the code to work, but it throws the error attached in image file error1_2.jpg and erroe2_2.jpg.
You can notice that there is a:nvidia error
usr/bin/nvidia-modprobe: unrecognized option: "-s"
ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.
/usr/bin/nvidia-modprobe: unrecognized option: "-s"
ERROR: Invalid commandline, please run `/usr/bin/nvidia-modprobe --help` for usage information.
You can notice that there is a user warning:
/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected ca
ll of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will re
sult in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Screenshots
Script used is attached.
You can notice that I removed evaluation during training to accelerate training.
You can notice that there is the following added line as well to avoid warning:
os.environ["TOKENIZERS_PARALLELISM"] = "true"
FYI, Wandb link of training is here:
https://wandb.ai/sam_enac/Esperanto - ELECTRA/runs/xzrqcl7g?workspace=user-sam_enac
You can notice that around global step 436, there is a sudden increase in training loss.
You can notice this warning in-between epoch1 end and epoch 2 starts.
WARNING:root:NaN or Inf found in input tensor.
Finally, the full error is below:
Traceback (most recent call last):
File "run_mlm.py", line 73, in <module>
main()
File "run_mlm.py", line 65, in main
model.train_model(
File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 376, in train_model
global_step, training_details = self.train(
File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 641, in train
outputs = model(inputs, labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/samuel/anaconda3/envs/pytorch-gpu/lib/python3.8/site-packages/simpletransformers/custom_models/models.py", line 533, in forward
sampled_tokens = torch.multinomial(sample_probs, 1).view(-1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Desktop (please complete the following information):
system=‘Linux’
node=‘dormammu’
release=‘5.4.0-62-generic’
version=‘#70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021’
machine=‘x86_64’
processor=‘x86_64’
gpus: “GeForce RTX 2080 Ti”
N.B:
There are 8 gpus on the server, but I used only one.
In another run, I tried using 6 gpus (n_gpus=6) but the computation was unexpectedly slower and there was also the weird increase in training so I killed the process.
Below are the attached files.
This is my first time reporting an issue, please don’t hesitate to tell me if my report is missing something.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (6 by maintainers)
Top GitHub Comments
“wordpieces_prefix” is not part of the Simple Transformers args so adding it to
train_args
won’t do anything. It’s from the Huggingface tokenizer, but it shouldn’t really affect anything. I think the reason it’s there on my wandb project but isn’t on yours is because of changes to the Huggingface Transformers library. It shouldn’t affect anything since I didn’t have any issues when I ran the code earlier today.You could also try setting
fp16=False
just in case this is being caused by some driver/GPU issue