Run Language Modeling on Colab TPU cores terminates
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): GPT2
Language I am using the model on (English, Chinese …): English (wikitext-2)
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
I’m trying to test run_language_modeling.py
on GPT2 using all 8 TPU cores.
Running on 1 core gives the following error:
Epoch: 0% 0/3 [00:00<?, ?it/s]
Iteration: 0it [00:00, ?it/s]Exception in device=TPU:0: 'NoneType' object cannot be interpreted as an integer
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 292, in _mp_fn
main()
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 260, in main
trainer.train(model_path=model_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 519, in train
self.epoch = epoch + (step + 1) / len(epoch_iterator)
TypeError: 'NoneType' object cannot be interpreted as an integer
While running using all 8 cores gives this one:
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "transformers/examples/xla_spawn.py", line 72, in <module>
main()
File "transformers/examples/xla_spawn.py", line 68, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
(error_index, name)
Exception: process 0 terminated with signal SIGKILL
I’m running this on a Colab TPU Notebook.
To reproduce
Steps to reproduce the behavior:
VERSION = "20200325" #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION
import torch_xla
import torch_xla.core.xla_model as xm
!pip install git+https://github.com/huggingface/transformers.git
!git clone https://github.com/huggingface/transformers.git
!curl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip
!unzip wikitext-2-v1.zip
!rm wikitext-2-v1.zip
!python transformers/examples/xla_spawn.py --num_cores 1 \
transformers/examples/language-modeling/run_language_modeling.py \
--output_dir=output \
--model_type=gpt2 \
--model_name_or_path=gpt2 \
--do_train \
--train_data_file=wikitext-2/wiki.train.tokens \
--do_eval \
--eval_data_file=wikitext-2/wiki.test.tokens \
--per_device_train_batch_size 1
Expected behavior
Finetuning the model and saves it.
Environment info
transformers
version: 3.0.2- Platform: Linux-4.19.104±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.5.0a0+d6149a7 (False)
- Tensorflow version (GPU?): 2.2.0 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: Yes and No
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Step-by-Step Use of Google Colab's Free TPU - Heartbeat
A guide to using Google's TPU to train powerful deep learning models.
Read more >Troubleshooting TensorFlow - TPU - Google Cloud
This guide, along with the FAQ, provides troubleshooting help for users who are training TensorFlow models on Cloud TPU. If you are troubleshooting...
Read more >Using TPUs on Google Colab with Keras
TPUs make training machine learning models very fast. They are available for free on Google Colab. But the process involved is somewhat cumbersome....
Read more >Tutorials for using Colab TPUs with Huggingface Transformers?
I looking for an easy-to-follow tutorial for using Huggingface Transformer models (e.g. BERT) in PyTorch on Google Colab with TPUs.
Read more >How to Train PyTorch Hugging Face Transformers on Cloud ...
In this Colab we walk you through Masked Language Modeling (MLM) ... train on the 8 cores a single v2-8/v3-8 Cloud TPU system...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Update: for the single core problem, removing
/ len(epoch_iterator)
part from this line solves the problem, so I suggest to precompute the value usinglen(train_loader)
before this if statement and use it later. The multicores problem is still exists, could it relate to RAM limits in Google Colab?The memory was the problem! In the beginning of the notebook, run the following cell to get the 35GB RAMs runtime instead of the 12GB one:
Then, use this snippet of code to finetune GPT-2 on wikitext-2:
It will be helpful to put this in the documentation :3