Different performance when training with single GPU vs. multiple GPUs
See original GitHub issueI’m currently using accelerate
to fine-tune a huggingface pretrained Transformer with some additional classification heads, and I’m finding that performance when using multiple GPUs is much worse than with a single GPU, even when using the same batch size, learning rate, and number of training steps/epochs. I’m using accelerate to parallelize the training loop over multiple GPUs, but the validation/test set evaluation is a custom function that isn’t easily adapted to use with accelerate so I’m doing that part on a single GPU in the main process. To run the entire script on a single GPU vs. multiple GPUs, I just adjust the --num_processes
argument for accelerate launch
as well as the batch size to match, for example:
accelerate launch --num_processes 1 <script> --batch_size 32
(for 1 GPU)
accelerate launch --num_processes 4 <script> --batch_size 8
(for 4 GPUs)
The multi-GPU training seems to be running fine, in the sense that running nvidia-smi
shows all 4 GPUs being fully utilized and the training data loader is the correct length for the given batch size (same length for both of the commands above), but there’s still a drop in performance in the multi-GPU case. When printing output on each GPU, the processes do seem to be waiting for the main process to finish running the evaluation function as expected. This also doesn’t seem to just be an issue of running single-GPU evaluation within a multi-GPU training loop, since loading the saved model weights after training and re-running the evaluation on a single GPU gives the same performance.
Any help is appreciated, thanks!
Pseudocode:
# set up accelerator
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])
# set up device (for evaluation)
device = accelerator.device
model = ... # initialize model
optimizer = ... # initialize optimizer
loader = ... # initialize training data loader
valid_dataset = ... # initialize validation dataset
test_dataset = ... # initialize test dataset
# prepare model, optimizer, data loader
model, optimizer, loader = accelerator.prepare(model, optimizer, loader)
# training loop
for epoch in range(epochs):
model.train()
for inputs, targets in loader:
outputs = model(inputs)
loss = loss_function(outputs, targets)
optimizer.zero_grad()
accelerator.backward(loss)
optimizer.step()
# evaluate on validation set with unwrapped model in main process (single-GPU)
if accelerator.is_main_process:
unwrapped_model = accelerator.unwrap_model(model).to(device)
unwrapped_model.eval()
metrics = calculate_metrics(unwrapped_model, valid_dataset, device)
print(metrics)
# evaluate on test set with unwrapped model in main process (single-GPU)
if accelerator.is_main_process:
unwrapped_model = accelerator.unwrap_model(model).to(device)
unwrapped_model.eval()
metrics = calculate_metrics(unwrapped_model, test_dataset, device)
print(metrics)
Version information:
torch
: 1.6.0transformers
: 3.3.1accelerate
: 0.3.0- CUDA: 10.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
For reproducibility, you can still set a seed, you should just set a different one for each process. One way to do this easily with Accelerate is:
Got it! I think this solves the problem I was having, closing the issue now – thanks for your help!