Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Results on TPU worse than on GPU (using colab)

See original GitHub issue

System Info

- `Accelerate` version: 0.11.0.dev0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0+cu102 (False)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: TPU
	- mixed_precision: no
	- use_cpu: False
	- num_processes: 8
	- machine_rank: 0
	- num_machines: 1
	- main_process_ip: None
	- main_process_port: None
	- main_training_function: main
	- deepspeed_config: {}
	- fsdp_config: {}

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I created notebook for reproducing, but the steps are very easy:

Install all libs (I tried different versions, all share same behaivor)
Run accelerate config and choose TPU (or GPU with fp16)
Run accelerate launch accelerate/examples/nlp_example.py

When I choose training with TPU, I could get 0.848 f1 score. When I train with GPU, I got more than 0.9. I also tried different scripts, and always get much worse results with TPU. Maybe it something colab-specific, because as I can see in another TPU related issues (for example) people getting results similar to my GPU results

Expected behavior

When I run example scripts in colab, I should get similar results with TPU and GPU.

Issue Analytics

State:
Created a year ago
Comments:8

Top GitHub Comments

1reaction

koba35commented, Jul 19, 2022

As it turned out, everything is much more complicated. When removing the model from the training function and increasing lr, I was able to achieve normal results, but apparently the point is also that in this case the model is initialized before we set_seed. If we set_seed before we start multiprocessing, then the results fall again. It seems to me that in this example, several different details just converged - a relatively small dataset, a large initial LR, a fixed number of steps in the scheduler, model initialization inside the function, a fixed seed inside each process (if I understand correctly, in every fork we should setting different seeds).

As for me, it is worth adding the following tweaks:

Initialize the model outside the training function
Add the process number to the seed (like seed = int(config[“seed”]) + accelerator.process_index)
Add warm-up steps to the scheduler as a percentage of the length of the dataset - num_warmup_steps=len(train_dataloader) * num_epochs * 0.1,
Multiply lr by num_processes.

Unfortunately, I can’t say more precisely - maybe it’s actually something else, but I couldn’t find it.

1reaction

koba35commented, Jul 18, 2022

Well, I tried divide or multiply lr/bs by degrees of 2. And when I launching via notebook launcher, I get similar results (well, exactly same results) But seems that I found root of problem - when I init model outside of train_fn and multiply lr by 8 - I could get 0.9 in f1. I will do another checks tomorrow and I’ll be more specific if this fixes the problem