question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different performance when training with single GPU vs. multiple GPUs

See original GitHub issue

I’m currently using accelerate to fine-tune a huggingface pretrained Transformer with some additional classification heads, and I’m finding that performance when using multiple GPUs is much worse than with a single GPU, even when using the same batch size, learning rate, and number of training steps/epochs. I’m using accelerate to parallelize the training loop over multiple GPUs, but the validation/test set evaluation is a custom function that isn’t easily adapted to use with accelerate so I’m doing that part on a single GPU in the main process. To run the entire script on a single GPU vs. multiple GPUs, I just adjust the --num_processes argument for accelerate launch as well as the batch size to match, for example:

accelerate launch --num_processes 1 <script> --batch_size 32 (for 1 GPU)

accelerate launch --num_processes 4 <script> --batch_size 8 (for 4 GPUs)

The multi-GPU training seems to be running fine, in the sense that running nvidia-smi shows all 4 GPUs being fully utilized and the training data loader is the correct length for the given batch size (same length for both of the commands above), but there’s still a drop in performance in the multi-GPU case. When printing output on each GPU, the processes do seem to be waiting for the main process to finish running the evaluation function as expected. This also doesn’t seem to just be an issue of running single-GPU evaluation within a multi-GPU training loop, since loading the saved model weights after training and re-running the evaluation on a single GPU gives the same performance.

Any help is appreciated, thanks!

Pseudocode:

# set up accelerator
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])

# set up device (for evaluation)
device = accelerator.device

model = ... # initialize model
optimizer = ... # initialize optimizer
loader = ... # initialize training data loader

valid_dataset = ... # initialize validation dataset
test_dataset = ... # initialize test dataset

# prepare model, optimizer, data loader
model, optimizer, loader = accelerator.prepare(model, optimizer, loader)

# training loop
for epoch in range(epochs):
    model.train()
    for inputs, targets in loader:
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        optimizer.zero_grad()
        accelerator.backward(loss)
        optimizer.step()

    # evaluate on validation set with unwrapped model in main process (single-GPU)
    if accelerator.is_main_process:
        unwrapped_model = accelerator.unwrap_model(model).to(device)
        unwrapped_model.eval()
        metrics = calculate_metrics(unwrapped_model, valid_dataset, device)
        print(metrics)

# evaluate on test set with unwrapped model in main process (single-GPU)
if accelerator.is_main_process:
    unwrapped_model = accelerator.unwrap_model(model).to(device)
    unwrapped_model.eval()
    metrics = calculate_metrics(unwrapped_model, test_dataset, device)
    print(metrics)

Version information:

  • torch: 1.6.0
  • transformers: 3.3.1
  • accelerate: 0.3.0
  • CUDA: 10.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, May 18, 2021

For reproducibility, you can still set a seed, you should just set a different one for each process. One way to do this easily with Accelerate is:

np.random.seed(seed + accelerator.process_index)
torch.manual_seed(seed + accelerator.process_index)
torch.cuda.manual_seed_all(seed + accelerator.process_index)
1reaction
rahulncommented, May 18, 2021

Got it! I think this solves the problem I was having, closing the issue now – thanks for your help!

Read more comments on GitHub >

github_iconTop Results From Across the Web

When should you go for a multi-GPU program vs a single GPU ...
The most obvious answer is the training time. With multi-GPU, training time is less. But in practice, you may need to work harder...
Read more >
How to scale training on multiple GPUs - Towards Data Science
If the current models were trained in a single GPU, ... We need to scale training methods to use 100s of GPUs or...
Read more >
The speedup compared with the single GPU training
The computing capability of GPU is higher than the traditional CPU cores [25] . If conducted from scratch, model training is time-consuming and...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
13.5. Training on Multiple GPUs - Dive into Deep Learning
So far we discussed how to train models efficiently on CPUs and GPUs. We even showed how deep learning frameworks allow one to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found