question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TPU not initialized when running official `run_mlm_flax.py` example.

See original GitHub issue

Environment info

  • transformers version: 4.9.0.dev0
  • Platform: Linux-5.4.0-1043-gcp-x86_64-with-glibc2.29
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.9.0+cu102 (False)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.3.4 (tpu)
  • Jax version: 0.2.16
  • JaxLib version: 0.1.68
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@avital @marcvanzee

Information

I am setting up a new TPU VM according to the Cloud TPU VM JAX quickstart and the following the installation steps as described here: https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects#how-to-install-relevant-libraries to install flax, jax transformers, and datasets.

Then, when running a simple example using the run_mlm_flax.py script, I’m encounting an error/ warning:

INFO:absl:Starting the local TPU driver.
INFO:absl:Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
INFO:absl:Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: TPU Interpreter Host

=> I am now unsure whether the code actually runs on TPU or instead on CPU.

To reproduce

The problem can be easily reproduced by:

  1. sshing into a TPU, e.g. patrick-test (Flax, JAX, & Transformers should already be installed)

If one goes into patrick-test the libraries are already installed - on an “newly” created TPU VM, one can follow these steps to install the relevant libraries.

  1. Going to home folder
cd ~/
  1. creating a new dir:
mkdir test && cd test
  1. cloning a dummy repo into it
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/patrickvonplaten/norwegian-roberta-als
  1. Linking the run_mlm_flax.py script
ln -s $(realpath ~/transformers/examples/flax/language-modeling/run_mlm_flax.py) ./
  1. Running the following command (which should show the above warning/error again):
./run_mlm_flax.py \
    --output_dir="norwegian-roberta-als" \
    --model_type="roberta" \
    --config_name="norwegian-roberta-als" \
    --tokenizer_name="norwegian-roberta-als" \
    --dataset_name="oscar" \
    --dataset_config_name="unshuffled_deduplicated_als" \
    --max_seq_length="128" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="8" \
    --learning_rate="3e-4" \
    --overwrite_output_dir \
    --num_train_epochs="3"

=>

You should see a console print that says:

[10:15:48] - INFO - absl - Starting the local TPU driver.
[10:15:48] - INFO - absl - Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
[10:15:48] - INFO - absl - Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: TPU Host Interpreter

Expected behavior

I think this warning / error should not be displayed and the TPU should be correctly configured.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
peregilkcommented, Oct 26, 2021

@erensezener I think a lot has changed in the code here since this was written. I am linking to my internal notes above. I have repeated that one several times, and know it gets a working system up and running.

Just a wild guess: Have you tried setting export USE_TORCH=False

0reactions
erensezenercommented, Oct 29, 2021

Just a wild guess: Have you tried setting export USE_TORCH=False

This solves the issue indeed! Thank you, you saved me many more hours of debugging 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting TensorFlow - TPU - Google Cloud
If the MNIST example runs correctly but your model still stops responding, ... If the model is too large to fit into TPU...
Read more >
Use TPUs | TensorFlow Core
TPU initialization. TPUs are typically Cloud TPU workers, which are different from the local process running the user's Python program. Thus, you need...
Read more >
Simple model can't run on tpu (on colab) - Stack Overflow
In your code it seems input_x is not TPU compatible. TPUs require constant shape and batch sizes. Share.
Read more >
Tutorials for using Colab TPUs with Huggingface Transformers?
I still cannot get any HuggingFace Tranformer model to train with a Google Colab TPU. I tried out the notebook mentioned above illustrating...
Read more >
Tensor Processing Unit (TPU) - PyTorch Lightning
Lightning supports running on TPUs. At this moment, TPUs are available on Google Cloud (GCP), Google Colab and Kaggle Environments.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found