Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run Language Modeling on Colab TPU cores terminates

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): GPT2

Language I am using the model on (English, Chinese …): English (wikitext-2)

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

I’m trying to test run_language_modeling.py on GPT2 using all 8 TPU cores.

Running on 1 core gives the following error:

Epoch:   0% 0/3 [00:00<?, ?it/s]
Iteration: 0it [00:00, ?it/s]Exception in device=TPU:0: 'NoneType' object cannot be interpreted as an integer
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 292, in _mp_fn
    main()
  File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 260, in main
    trainer.train(model_path=model_path)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 519, in train
    self.epoch = epoch + (step + 1) / len(epoch_iterator)
TypeError: 'NoneType' object cannot be interpreted as an integer

While running using all 8 cores gives this one:

/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "transformers/examples/xla_spawn.py", line 72, in <module>
    main()
  File "transformers/examples/xla_spawn.py", line 68, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

I’m running this on a Colab TPU Notebook.

To reproduce

Steps to reproduce the behavior:

VERSION = "20200325"  #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION
import torch_xla
import torch_xla.core.xla_model as xm

!pip install git+https://github.com/huggingface/transformers.git

!git clone https://github.com/huggingface/transformers.git

!curl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip
!unzip wikitext-2-v1.zip
!rm wikitext-2-v1.zip

!python transformers/examples/xla_spawn.py --num_cores 1 \
	transformers/examples/language-modeling/run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=wikitext-2/wiki.train.tokens \
    --do_eval \
    --eval_data_file=wikitext-2/wiki.test.tokens \
    --per_device_train_batch_size 1

Expected behavior

Finetuning the model and saves it.

Environment info

transformers version: 3.0.2
Platform: Linux-4.19.104±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.0a0+d6149a7 (False)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: Yes and No

Issue Analytics

State:
Created 3 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

AliOsmcommented, Jul 13, 2020

Update: for the single core problem, removing / len(epoch_iterator) part from this line solves the problem, so I suggest to precompute the value using len(train_loader) before this if statement and use it later. The multicores problem is still exists, could it relate to RAM limits in Google Colab?

0reactions

AliOsmcommented, Jul 25, 2020

The memory was the problem! In the beginning of the notebook, run the following cell to get the 35GB RAMs runtime instead of the 12GB one:

import torch
torch.tensor([10.]*10000000000)

Then, use this snippet of code to finetune GPT-2 on wikitext-2:

VERSION = "nightly"  #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION

!pip install git+https://github.com/huggingface/transformers.git

!git clone https://github.com/huggingface/transformers.git

!curl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip
!unzip wikitext-2-v1.zip
!rm wikitext-2-v1.zip

!python transformers/examples/xla_spawn.py --num_cores 8 \
	transformers/examples/language-modeling/run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=wikitext-2/wiki.train.tokens \
    --do_eval \
    --eval_data_file=wikitext-2/wiki.test.tokens \
    --per_device_train_batch_size 2 \
    --overwrite_output_dir

It will be helpful to put this in the documentation :3

Top Results From Across the Web

Step-by-Step Use of Google Colab's Free TPU - Heartbeat

A guide to using Google's TPU to train powerful deep learning models.

Troubleshooting TensorFlow - TPU - Google Cloud

This guide, along with the FAQ, provides troubleshooting help for users who are training TensorFlow models on Cloud TPU. If you are troubleshooting...

Using TPUs on Google Colab with Keras

TPUs make training machine learning models very fast. They are available for free on Google Colab. But the process involved is somewhat cumbersome....

Tutorials for using Colab TPUs with Huggingface Transformers?

I looking for an easy-to-follow tutorial for using Huggingface Transformer models (e.g. BERT) in PyTorch on Google Colab with TPUs.

How to Train PyTorch Hugging Face Transformers on Cloud ...

In this Colab we walk you through Masked Language Modeling (MLM) ... train on the 8 cores a single v2-8/v3-8 Cloud TPU system...