question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi GPU does not work well

See original GitHub issue

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- Numpy version: 1.23.0
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

import os, time 
import torch
from torch.optim import Adam, SGD
from accelerate import Accelerator
from PIL import Image
from torchvision import transforms
import torchvision 
from datasets import load_dataset
import datetime 
from loguru import logger
LOGGER = logger
LOGGER.add('/root/workspace/sdhan/multi_gpu/log/model_log.txt')


def training_function():
    # if accelerator is True:
    accelerator = Accelerator()
    device = accelerator.device 

    model = torchvision.models.resnet34(pretrained = True).to(device)
    preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    datasets = torchvision.datasets.CIFAR10(root = '/root/workspace/sdhan/multi_gpu/datasets/', train = True, transform = preprocess, download = True)

    train_loader = torch.utils.data.DataLoader(
            datasets,
            batch_size=50,
            shuffle=True,
            drop_last=True, 
            num_workers = 8)

    optimizer = SGD(model.parameters(), lr = 3e-7)

    criterion = torch.nn.CrossEntropyLoss()
    # if accelerator is not None:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

    epoch = 300
    i = 0
    model.train()
    start_time, end_time= None, None 

    for epoch_num in range(epoch):
        start_epoch_time = datetime.datetime.now()
        if start_time is None:
            start_time = datetime.datetime.now()
        for image, target in train_loader:
            try:
                image.to(device)
                target.to(device)
                output = model(image)
                loss = criterion(output, target)

                accelerator.backward(loss)

                optimizer.step()
                i += 1 
                if i%100 == 0:
                    end_time = datetime.datetime.now()
                    LOGGER.info(f'time : {end_time - start_time}')
                    start_time = end_time 

            except Exception as e:
                print(e)
                break
        end_epoch_time = datetime.datetime.now()
        LOGGER.info(f'epoch time : {end_epoch_time - start_epoch_time}')
        LOGGER.info(f'epoch_loss : {loss.item()}')

def main():
    training_function()

if __name__ == "__main__":
    main()

Expected behavior

I think I did everything what 'README.md' said for multi-gpu learning. 
When I use the complete_nlp_example.py in example folder, it worked. 
(I ran the script with 4 GPU, and for 1 epoch it took around 3 seconds, compared with around 9 second with 1 GPU. )

But when I use the code I pasted, it took around 9 Second for 4 GPU, compared with around 8 second with 1 GPU for 1 epoch. However, I exepcted it would took around 4 second at least when I use 4 GPU. 

Strangely, the GPU-util is very good when I use 4 GPU as well as when I use 1 GPU. 

What is the problem? not only for the code above, when I try to training another model, same result occured. ( Denoising diffusion probablistic model )

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6

github_iconTop GitHub Comments

2reactions
Hans-digitcommented, Jul 2, 2022

Hello @pacman100, really thanks for your help!

I really appreciate for your help!

I think it is really interesting, and I will close this issue.

0reactions
pacman100commented, Jul 2, 2022

Hello @Hans-digit , loss is dependent on the learning rate hyperparameter. With such a small learning rate that you have, more steps are necessary for decreasing loss, which is the case with smaller batch using single GPU. Please finetune it properly and you will see the difference disappear. Measure loss for whole epoch instead of printing the final loss in an epoch. Please refer the example scripts on how to do that. Using 1e-3 as learning rate and measuring epoch loss properly show the below results.

2 GPU setup:

2022-07-02 07:42:09.937 | INFO     | __main__:training_function:68 - time : 0:00:10.736053
2022-07-02 07:42:09.943 | INFO     | __main__:training_function:68 - time : 0:00:10.742042
2022-07-02 07:42:20.144 | INFO     | __main__:training_function:68 - time : 0:00:10.206601
2022-07-02 07:42:20.147 | INFO     | __main__:training_function:68 - time : 0:00:10.203795
2022-07-02 07:42:30.342 | INFO     | __main__:training_function:68 - time : 0:00:10.194762
2022-07-02 07:42:30.342 | INFO     | __main__:training_function:68 - time : 0:00:10.198642
2022-07-02 07:42:40.537 | INFO     | __main__:training_function:68 - time : 0:00:10.194199
2022-07-02 07:42:40.541 | INFO     | __main__:training_function:68 - time : 0:00:10.199517
2022-07-02 07:42:50.738 | INFO     | __main__:training_function:68 - time : 0:00:10.197191
2022-07-02 07:42:50.742 | INFO     | __main__:training_function:68 - time : 0:00:10.205452
2022-07-02 07:42:50.797 | INFO     | __main__:training_function:75 - epoch time : 0:00:51.595806
2022-07-02 07:42:50.797 | INFO     | __main__:training_function:76 - epoch_loss : 1.2795213606357574
2022-07-02 07:42:50.807 | INFO     | __main__:training_function:75 - epoch time : 0:00:51.605771
2022-07-02 07:42:50.807 | INFO     | __main__:training_function:76 - epoch_loss : 1.297331861972809

1 GPU setup:

2022-07-02 07:43:36.676 | INFO     | __main__:training_function:68 - time : 0:00:10.159390
2022-07-02 07:43:46.483 | INFO     | __main__:training_function:68 - time : 0:00:09.806761
2022-07-02 07:43:56.286 | INFO     | __main__:training_function:68 - time : 0:00:09.802893
2022-07-02 07:44:06.089 | INFO     | __main__:training_function:68 - time : 0:00:09.802949
2022-07-02 07:44:15.886 | INFO     | __main__:training_function:68 - time : 0:00:09.797208
2022-07-02 07:44:25.685 | INFO     | __main__:training_function:68 - time : 0:00:09.798611
2022-07-02 07:44:35.482 | INFO     | __main__:training_function:68 - time : 0:00:09.797353
2022-07-02 07:44:45.282 | INFO     | __main__:training_function:68 - time : 0:00:09.799936
2022-07-02 07:44:55.089 | INFO     | __main__:training_function:68 - time : 0:00:09.806451
2022-07-02 07:45:04.896 | INFO     | __main__:training_function:68 - time : 0:00:09.807061
2022-07-02 07:45:04.945 | INFO     | __main__:training_function:75 - epoch time : 0:01:38.428492
2022-07-02 07:45:04.946 | INFO     | __main__:training_function:76 - epoch_loss : 1.4030447309613228

your code with suggested changes:

import os, time 
import torch
from torch.optim import Adam, SGD
from accelerate import Accelerator
from PIL import Image
from torchvision import transforms
import torchvision 
from datasets import load_dataset
import datetime 
from loguru import logger
LOGGER = logger
LOGGER.add('/tmp/model_log.txt')

LR = 1e-3 #3e-7

def training_function():
    # if accelerator is True:
    accelerator = Accelerator()
    device = accelerator.device 

    model = torchvision.models.resnet34(pretrained = True).to(device)
    preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    datasets = torchvision.datasets.CIFAR10(root = '/tmp/datasets/', train = True, transform = preprocess, download = True)

    train_loader = torch.utils.data.DataLoader(
            datasets,
            batch_size=50,
            shuffle=True,
            drop_last=True, 
            num_workers = 8)

    optimizer = SGD(model.parameters(), lr = LR)

    criterion = torch.nn.CrossEntropyLoss()
    # if accelerator is not None:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

    epoch = 1
    i = 0
    model.train()
    start_time, end_time= None, None 

    for epoch_num in range(epoch):
        train_loss = 0
        start_epoch_time = datetime.datetime.now()
        if start_time is None:
            start_time = datetime.datetime.now()
        for image, target in train_loader:
            try:
                image.to(device)
                target.to(device)
                output = model(image)
                loss = criterion(output, target)
                train_loss += loss.item()

                accelerator.backward(loss)

                optimizer.step()
                i += 1 
                if i%100 == 0:
                    end_time = datetime.datetime.now()
                    LOGGER.info(f'time : {end_time - start_time}')
                    start_time = end_time 

            except Exception as e:
                print(e)
                break
        end_epoch_time = datetime.datetime.now()
        LOGGER.info(f'epoch time : {end_epoch_time - start_epoch_time}')
        LOGGER.info(f'epoch_loss : {train_loss/len(train_loader)}')

def main():
    training_function()

if __name__ == "__main__":
    main()
Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi GPU Issues - Enscape
How to deal with issues with Enscape and Multi GPU's.
Read more >
What reasons are there as to why Multi-GPU setups don't ...
Since many people say that the reason why multi-GPU setups don't scale well in some games is because there is no code for...
Read more >
Why Don't Multiple GPUs Scale Properly? - YouTube
Multi - GPU setups like SLI and CrossFire can deliver performance you just can 't get with a single card, but two GPUs...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Multi-GPU doesn't work for model(inputs) nor when computing ...
Hi,. When using multiple GPUs to perform inference on a model (e.g. the call method: model(inputs) ) and calculate its gradients, the machine ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found