Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi GPU does not work well

See original GitHub issue

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.23.0
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

import os, time 
import torch
from torch.optim import Adam, SGD
from accelerate import Accelerator
from PIL import Image
from torchvision import transforms
import torchvision 
from datasets import load_dataset
import datetime 
from loguru import logger
LOGGER = logger
LOGGER.add('/root/workspace/sdhan/multi_gpu/log/model_log.txt')


def training_function():
    # if accelerator is True:
    accelerator = Accelerator()
    device = accelerator.device 

    model = torchvision.models.resnet34(pretrained = True).to(device)
    preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    datasets = torchvision.datasets.CIFAR10(root = '/root/workspace/sdhan/multi_gpu/datasets/', train = True, transform = preprocess, download = True)

    train_loader = torch.utils.data.DataLoader(
            datasets,
            batch_size=50,
            shuffle=True,
            drop_last=True, 
            num_workers = 8)

    optimizer = SGD(model.parameters(), lr = 3e-7)

    criterion = torch.nn.CrossEntropyLoss()
    # if accelerator is not None:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

    epoch = 300
    i = 0
    model.train()
    start_time, end_time= None, None 

    for epoch_num in range(epoch):
        start_epoch_time = datetime.datetime.now()
        if start_time is None:
            start_time = datetime.datetime.now()
        for image, target in train_loader:
            try:
                image.to(device)
                target.to(device)
                output = model(image)
                loss = criterion(output, target)

                accelerator.backward(loss)

                optimizer.step()
                i += 1 
                if i%100 == 0:
                    end_time = datetime.datetime.now()
                    LOGGER.info(f'time : {end_time - start_time}')
                    start_time = end_time 

            except Exception as e:
                print(e)
                break
        end_epoch_time = datetime.datetime.now()
        LOGGER.info(f'epoch time : {end_epoch_time - start_epoch_time}')
        LOGGER.info(f'epoch_loss : {loss.item()}')

def main():
    training_function()

if __name__ == "__main__":
    main()

Expected behavior

I think I did everything what 'README.md' said for multi-gpu learning. 
When I use the complete_nlp_example.py in example folder, it worked. 
(I ran the script with 4 GPU, and for 1 epoch it took around 3 seconds, compared with around 9 second with 1 GPU. )

But when I use the code I pasted, it took around 9 Second for 4 GPU, compared with around 8 second with 1 GPU for 1 epoch. However, I exepcted it would took around 4 second at least when I use 4 GPU. 

Strangely, the GPU-util is very good when I use 4 GPU as well as when I use 1 GPU. 

What is the problem? not only for the code above, when I try to training another model, same result occured. ( Denoising diffusion probablistic model )

Issue Analytics

State:
Created a year ago
Comments:6

Top GitHub Comments

2reactions

Hans-digitcommented, Jul 2, 2022

Hello @pacman100, really thanks for your help!

I really appreciate for your help!

I think it is really interesting, and I will close this issue.

0reactions

pacman100commented, Jul 2, 2022

Hello @Hans-digit , loss is dependent on the learning rate hyperparameter. With such a small learning rate that you have, more steps are necessary for decreasing loss, which is the case with smaller batch using single GPU. Please finetune it properly and you will see the difference disappear. Measure loss for whole epoch instead of printing the final loss in an epoch. Please refer the example scripts on how to do that. Using 1e-3 as learning rate and measuring epoch loss properly show the below results.

2 GPU setup:

2022-07-02 07:42:09.937 | INFO     | __main__:training_function:68 - time : 0:00:10.736053
2022-07-02 07:42:09.943 | INFO     | __main__:training_function:68 - time : 0:00:10.742042
2022-07-02 07:42:20.144 | INFO     | __main__:training_function:68 - time : 0:00:10.206601
2022-07-02 07:42:20.147 | INFO     | __main__:training_function:68 - time : 0:00:10.203795
2022-07-02 07:42:30.342 | INFO     | __main__:training_function:68 - time : 0:00:10.194762
2022-07-02 07:42:30.342 | INFO     | __main__:training_function:68 - time : 0:00:10.198642
2022-07-02 07:42:40.537 | INFO     | __main__:training_function:68 - time : 0:00:10.194199
2022-07-02 07:42:40.541 | INFO     | __main__:training_function:68 - time : 0:00:10.199517
2022-07-02 07:42:50.738 | INFO     | __main__:training_function:68 - time : 0:00:10.197191
2022-07-02 07:42:50.742 | INFO     | __main__:training_function:68 - time : 0:00:10.205452
2022-07-02 07:42:50.797 | INFO     | __main__:training_function:75 - epoch time : 0:00:51.595806
2022-07-02 07:42:50.797 | INFO     | __main__:training_function:76 - epoch_loss : 1.2795213606357574
2022-07-02 07:42:50.807 | INFO     | __main__:training_function:75 - epoch time : 0:00:51.605771
2022-07-02 07:42:50.807 | INFO     | __main__:training_function:76 - epoch_loss : 1.297331861972809

1 GPU setup:

2022-07-02 07:43:36.676 | INFO     | __main__:training_function:68 - time : 0:00:10.159390
2022-07-02 07:43:46.483 | INFO     | __main__:training_function:68 - time : 0:00:09.806761
2022-07-02 07:43:56.286 | INFO     | __main__:training_function:68 - time : 0:00:09.802893
2022-07-02 07:44:06.089 | INFO     | __main__:training_function:68 - time : 0:00:09.802949
2022-07-02 07:44:15.886 | INFO     | __main__:training_function:68 - time : 0:00:09.797208
2022-07-02 07:44:25.685 | INFO     | __main__:training_function:68 - time : 0:00:09.798611
2022-07-02 07:44:35.482 | INFO     | __main__:training_function:68 - time : 0:00:09.797353
2022-07-02 07:44:45.282 | INFO     | __main__:training_function:68 - time : 0:00:09.799936
2022-07-02 07:44:55.089 | INFO     | __main__:training_function:68 - time : 0:00:09.806451
2022-07-02 07:45:04.896 | INFO     | __main__:training_function:68 - time : 0:00:09.807061
2022-07-02 07:45:04.945 | INFO     | __main__:training_function:75 - epoch time : 0:01:38.428492
2022-07-02 07:45:04.946 | INFO     | __main__:training_function:76 - epoch_loss : 1.4030447309613228

your code with suggested changes:

import os, time 
import torch
from torch.optim import Adam, SGD
from accelerate import Accelerator
from PIL import Image
from torchvision import transforms
import torchvision 
from datasets import load_dataset
import datetime 
from loguru import logger
LOGGER = logger
LOGGER.add('/tmp/model_log.txt')

LR = 1e-3 #3e-7

def training_function():
    # if accelerator is True:
    accelerator = Accelerator()
    device = accelerator.device 

    model = torchvision.models.resnet34(pretrained = True).to(device)
    preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    datasets = torchvision.datasets.CIFAR10(root = '/tmp/datasets/', train = True, transform = preprocess, download = True)

    train_loader = torch.utils.data.DataLoader(
            datasets,
            batch_size=50,
            shuffle=True,
            drop_last=True, 
            num_workers = 8)

    optimizer = SGD(model.parameters(), lr = LR)

    criterion = torch.nn.CrossEntropyLoss()
    # if accelerator is not None:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

    epoch = 1
    i = 0
    model.train()
    start_time, end_time= None, None 

    for epoch_num in range(epoch):
        train_loss = 0
        start_epoch_time = datetime.datetime.now()
        if start_time is None:
            start_time = datetime.datetime.now()
        for image, target in train_loader:
            try:
                image.to(device)
                target.to(device)
                output = model(image)
                loss = criterion(output, target)
                train_loss += loss.item()

                accelerator.backward(loss)

                optimizer.step()
                i += 1 
                if i%100 == 0:
                    end_time = datetime.datetime.now()
                    LOGGER.info(f'time : {end_time - start_time}')
                    start_time = end_time 

            except Exception as e:
                print(e)
                break
        end_epoch_time = datetime.datetime.now()
        LOGGER.info(f'epoch time : {end_epoch_time - start_epoch_time}')
        LOGGER.info(f'epoch_loss : {train_loss/len(train_loader)}')

def main():
    training_function()

if __name__ == "__main__":
    main()