Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different performance when training with single GPU vs. multiple GPUs

See original GitHub issue

I’m currently using accelerate to fine-tune a huggingface pretrained Transformer with some additional classification heads, and I’m finding that performance when using multiple GPUs is much worse than with a single GPU, even when using the same batch size, learning rate, and number of training steps/epochs. I’m using accelerate to parallelize the training loop over multiple GPUs, but the validation/test set evaluation is a custom function that isn’t easily adapted to use with accelerate so I’m doing that part on a single GPU in the main process. To run the entire script on a single GPU vs. multiple GPUs, I just adjust the --num_processes argument for accelerate launch as well as the batch size to match, for example:

accelerate launch --num_processes 1 <script> --batch_size 32 (for 1 GPU)

accelerate launch --num_processes 4 <script> --batch_size 8 (for 4 GPUs)

The multi-GPU training seems to be running fine, in the sense that running nvidia-smi shows all 4 GPUs being fully utilized and the training data loader is the correct length for the given batch size (same length for both of the commands above), but there’s still a drop in performance in the multi-GPU case. When printing output on each GPU, the processes do seem to be waiting for the main process to finish running the evaluation function as expected. This also doesn’t seem to just be an issue of running single-GPU evaluation within a multi-GPU training loop, since loading the saved model weights after training and re-running the evaluation on a single GPU gives the same performance.

Any help is appreciated, thanks!

Pseudocode:

# set up accelerator
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])

# set up device (for evaluation)
device = accelerator.device

model = ... # initialize model
optimizer = ... # initialize optimizer
loader = ... # initialize training data loader

valid_dataset = ... # initialize validation dataset
test_dataset = ... # initialize test dataset

# prepare model, optimizer, data loader
model, optimizer, loader = accelerator.prepare(model, optimizer, loader)

# training loop
for epoch in range(epochs):
    model.train()
    for inputs, targets in loader:
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        optimizer.zero_grad()
        accelerator.backward(loss)
        optimizer.step()

    # evaluate on validation set with unwrapped model in main process (single-GPU)
    if accelerator.is_main_process:
        unwrapped_model = accelerator.unwrap_model(model).to(device)
        unwrapped_model.eval()
        metrics = calculate_metrics(unwrapped_model, valid_dataset, device)
        print(metrics)

# evaluate on test set with unwrapped model in main process (single-GPU)
if accelerator.is_main_process:
    unwrapped_model = accelerator.unwrap_model(model).to(device)
    unwrapped_model.eval()
    metrics = calculate_metrics(unwrapped_model, test_dataset, device)
    print(metrics)

Version information:

torch: 1.6.0
transformers: 3.3.1
accelerate: 0.3.0
CUDA: 10.1

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, May 18, 2021

For reproducibility, you can still set a seed, you should just set a different one for each process. One way to do this easily with Accelerate is:

np.random.seed(seed + accelerator.process_index)
torch.manual_seed(seed + accelerator.process_index)
torch.cuda.manual_seed_all(seed + accelerator.process_index)

1reaction

rahulncommented, May 18, 2021

Got it! I think this solves the problem I was having, closing the issue now – thanks for your help!