question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run Language Modeling on Colab TPU cores terminates

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): GPT2

Language I am using the model on (English, Chinese …): English (wikitext-2)

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

I’m trying to test run_language_modeling.py on GPT2 using all 8 TPU cores.

Running on 1 core gives the following error:

Epoch:   0% 0/3 [00:00<?, ?it/s]
Iteration: 0it [00:00, ?it/s]Exception in device=TPU:0: 'NoneType' object cannot be interpreted as an integer
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 292, in _mp_fn
    main()
  File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 260, in main
    trainer.train(model_path=model_path)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 519, in train
    self.epoch = epoch + (step + 1) / len(epoch_iterator)
TypeError: 'NoneType' object cannot be interpreted as an integer

While running using all 8 cores gives this one:

/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "transformers/examples/xla_spawn.py", line 72, in <module>
    main()
  File "transformers/examples/xla_spawn.py", line 68, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

I’m running this on a Colab TPU Notebook.

To reproduce

Steps to reproduce the behavior:

VERSION = "20200325"  #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION
import torch_xla
import torch_xla.core.xla_model as xm

!pip install git+https://github.com/huggingface/transformers.git

!git clone https://github.com/huggingface/transformers.git

!curl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip
!unzip wikitext-2-v1.zip
!rm wikitext-2-v1.zip

!python transformers/examples/xla_spawn.py --num_cores 1 \
	transformers/examples/language-modeling/run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=wikitext-2/wiki.train.tokens \
    --do_eval \
    --eval_data_file=wikitext-2/wiki.test.tokens \
    --per_device_train_batch_size 1

Expected behavior

Finetuning the model and saves it.

Environment info

  • transformers version: 3.0.2
  • Platform: Linux-4.19.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.5.0a0+d6149a7 (False)
  • Tensorflow version (GPU?): 2.2.0 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: Yes and No

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
AliOsmcommented, Jul 13, 2020

Update: for the single core problem, removing / len(epoch_iterator) part from this line solves the problem, so I suggest to precompute the value using len(train_loader) before this if statement and use it later. The multicores problem is still exists, could it relate to RAM limits in Google Colab?

0reactions
AliOsmcommented, Jul 25, 2020

The memory was the problem! In the beginning of the notebook, run the following cell to get the 35GB RAMs runtime instead of the 12GB one:

import torch
torch.tensor([10.]*10000000000)

Then, use this snippet of code to finetune GPT-2 on wikitext-2:

VERSION = "nightly"  #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION

!pip install git+https://github.com/huggingface/transformers.git

!git clone https://github.com/huggingface/transformers.git

!curl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip
!unzip wikitext-2-v1.zip
!rm wikitext-2-v1.zip

!python transformers/examples/xla_spawn.py --num_cores 8 \
	transformers/examples/language-modeling/run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=wikitext-2/wiki.train.tokens \
    --do_eval \
    --eval_data_file=wikitext-2/wiki.test.tokens \
    --per_device_train_batch_size 2 \
    --overwrite_output_dir

It will be helpful to put this in the documentation :3

Read more comments on GitHub >

github_iconTop Results From Across the Web

Step-by-Step Use of Google Colab's Free TPU - Heartbeat
A guide to using Google's TPU to train powerful deep learning models.
Read more >
Troubleshooting TensorFlow - TPU - Google Cloud
This guide, along with the FAQ, provides troubleshooting help for users who are training TensorFlow models on Cloud TPU. If you are troubleshooting...
Read more >
Using TPUs on Google Colab with Keras
TPUs make training machine learning models very fast. They are available for free on Google Colab. But the process involved is somewhat cumbersome....
Read more >
Tutorials for using Colab TPUs with Huggingface Transformers?
I looking for an easy-to-follow tutorial for using Huggingface Transformer models (e.g. BERT) in PyTorch on Google Colab with TPUs.
Read more >
How to Train PyTorch Hugging Face Transformers on Cloud ...
In this Colab we walk you through Masked Language Modeling (MLM) ... train on the 8 cores a single v2-8/v3-8 Cloud TPU system...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found