Results on TPU worse than on GPU (using colab)
See original GitHub issueSystem Info
- `Accelerate` version: 0.11.0.dev0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0+cu102 (False)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: TPU
- mixed_precision: no
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
I created notebook for reproducing, but the steps are very easy:
- Install all libs (I tried different versions, all share same behaivor)
- Run accelerate config and choose TPU (or GPU with fp16)
- Run accelerate launch accelerate/examples/nlp_example.py
When I choose training with TPU, I could get 0.848 f1 score. When I train with GPU, I got more than 0.9. I also tried different scripts, and always get much worse results with TPU. Maybe it something colab-specific, because as I can see in another TPU related issues (for example) people getting results similar to my GPU results
Expected behavior
When I run example scripts in colab, I should get similar results with TPU and GPU.
Issue Analytics
- State:
- Created a year ago
- Comments:8
Top Results From Across the Web
Comparing GPU and TPU training performance on Google ...
Here are some tests I did to see how much better (or worse) the training time on the TPU accelerator is compared to...
Read more >Google Collab (GPU vs TPU) [D] : r/MachineLearning - Reddit
i.e. TPU results (vs GPU) show lower seed variance and there is a clear positive bias (i.e. average (over seeds) val_accuracy when using...
Read more >Step-by-Step Use of Google Colab's Free TPU - Heartbeat
This shows that the TPU is about 2 times faster than the GPU and 110 times faster than the CPU.
Read more >Google Colab: Why is CPU faster than TPU? - Stack Overflow
GPUs can be characterized as having a very high number of low performing threads. A Cloud TPU, with its 128 x 128 matrix...
Read more >Tensorflow Keras on Local GPU vs Colab GPU vs Colab TPU
If I am using my laptop, the silver colored one with AMD GPU, the anwser is yes I would definitely use Colab GPU....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
As it turned out, everything is much more complicated. When removing the model from the training function and increasing lr, I was able to achieve normal results, but apparently the point is also that in this case the model is initialized before we set_seed. If we set_seed before we start multiprocessing, then the results fall again. It seems to me that in this example, several different details just converged - a relatively small dataset, a large initial LR, a fixed number of steps in the scheduler, model initialization inside the function, a fixed seed inside each process (if I understand correctly, in every fork we should setting different seeds).
As for me, it is worth adding the following tweaks:
Unfortunately, I can’t say more precisely - maybe it’s actually something else, but I couldn’t find it.
Well, I tried divide or multiply lr/bs by degrees of 2. And when I launching via notebook launcher, I get similar results (well, exactly same results) But seems that I found root of problem - when I init model outside of train_fn and multiply lr by 8 - I could get 0.9 in f1. I will do another checks tomorrow and I’ll be more specific if this fixes the problem