Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TPU not initialized when running official `run_mlm_flax.py` example.

See original GitHub issue

Environment info

transformers version: 4.9.0.dev0
Platform: Linux-5.4.0-1043-gcp-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.9.0+cu102 (False)
Tensorflow version (GPU?): 2.5.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.4 (tpu)
Jax version: 0.2.16
JaxLib version: 0.1.68
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@avital @marcvanzee

Information

I am setting up a new TPU VM according to the Cloud TPU VM JAX quickstart and the following the installation steps as described here: https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects#how-to-install-relevant-libraries to install flax, jax transformers, and datasets.

Then, when running a simple example using the run_mlm_flax.py script, I’m encounting an error/ warning:

INFO:absl:Starting the local TPU driver.
INFO:absl:Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
INFO:absl:Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: TPU Interpreter Host

=> I am now unsure whether the code actually runs on TPU or instead on CPU.

To reproduce

The problem can be easily reproduced by:

sshing into a TPU, e.g. patrick-test (Flax, JAX, & Transformers should already be installed)

If one goes into patrick-test the libraries are already installed - on an “newly” created TPU VM, one can follow these steps to install the relevant libraries.

Going to home folder

cd ~/

creating a new dir:

mkdir test && cd test

cloning a dummy repo into it

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/patrickvonplaten/norwegian-roberta-als

Linking the run_mlm_flax.py script

ln -s $(realpath ~/transformers/examples/flax/language-modeling/run_mlm_flax.py) ./

Running the following command (which should show the above warning/error again):

./run_mlm_flax.py \
    --output_dir="norwegian-roberta-als" \
    --model_type="roberta" \
    --config_name="norwegian-roberta-als" \
    --tokenizer_name="norwegian-roberta-als" \
    --dataset_name="oscar" \
    --dataset_config_name="unshuffled_deduplicated_als" \
    --max_seq_length="128" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="8" \
    --learning_rate="3e-4" \
    --overwrite_output_dir \
    --num_train_epochs="3"

You should see a console print that says:

[10:15:48] - INFO - absl - Starting the local TPU driver.
[10:15:48] - INFO - absl - Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
[10:15:48] - INFO - absl - Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: TPU Host Interpreter

Expected behavior

I think this warning / error should not be displayed and the TPU should be correctly configured.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:13 (10 by maintainers)

Top GitHub Comments

1reaction

peregilkcommented, Oct 26, 2021

@erensezener I think a lot has changed in the code here since this was written. I am linking to my internal notes above. I have repeated that one several times, and know it gets a working system up and running.

Just a wild guess: Have you tried setting export USE_TORCH=False

0reactions

erensezenercommented, Oct 29, 2021

Just a wild guess: Have you tried setting export USE_TORCH=False

This solves the issue indeed! Thank you, you saved me many more hours of debugging 😃

Top Results From Across the Web

Troubleshooting TensorFlow - TPU - Google Cloud

If the MNIST example runs correctly but your model still stops responding, ... If the model is too large to fit into TPU...

Use TPUs | TensorFlow Core

TPU initialization. TPUs are typically Cloud TPU workers, which are different from the local process running the user's Python program. Thus, you need...

Simple model can't run on tpu (on colab) - Stack Overflow

In your code it seems input_x is not TPU compatible. TPUs require constant shape and batch sizes. Share.

Tutorials for using Colab TPUs with Huggingface Transformers?

I still cannot get any HuggingFace Tranformer model to train with a Google Colab TPU. I tried out the notebook mentioned above illustrating...

Tensor Processing Unit (TPU) - PyTorch Lightning

Lightning supports running on TPUs. At this moment, TPUs are available on Google Cloud (GCP), Google Colab and Kaggle Environments.