question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-gpu training using flash fails for video classification

See original GitHub issue

🐛 Bug

Using strategy='ddp'

Traceback (most recent call last):
  File "/workspace/test_train.py", line 50, in <module>
    trainer.finetune(model, datamodule=datamodule, strategy=NoFreeze())
  File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 163, in finetune
    return super().fit(model, train_dataloader, val_dataloaders, datamodule)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 140, in run
    self.on_run_start(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in on_run_start
    self.trainer.call_hook("on_train_start")
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/flash/video/classification/model.py", line 123, in on_train_start
    encoded_dataset = self.trainer.train_dataloader.loaders.dataset.dataset
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 163, in __getattr__
    raise AttributeError
AttributeError

Removing strategy flag in Trainer i.e. using default strategy="ddp_spawn"

Traceback (most recent call last):
  File "test_train.py", line 50, in <module>
    trainer.finetune(model, datamodule=datamodule, strategy=NoFreeze())
  File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 163, in finetune
    return super().fit(model, train_dataloader, val_dataloaders, datamodule)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training
    self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn
    mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
    process.start()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'torch._C.Generator' object

Training works fine using 1 gpu.

To Reproduce

python3 test_train.py --gpus 8 --download True --backbone i3d_r50

Code sample

import os
from argparse import ArgumentParser

from torch.utils.data.sampler import RandomSampler

import flash
from flash.core.finetuning import NoFreeze
from flash.core.data.utils import download_data
from flash.video import VideoClassificationData, VideoClassifier

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--seed', type=int, default=1234)
    parser.add_argument('--backbone', type=str, default="x3d_xs")
    parser.add_argument('--download', type=bool, default=True)
    parser.add_argument('--train_folder', type=str, default=os.path.join(os.getcwd(),
                        "./data/kinetics/train"))
    parser.add_argument('--val_folder', type=str, default=os.path.join(os.getcwd(),
                        "./data/kinetics/val"))
    parser.add_argument('--predict_folder', type=str, default=os.path.join(os.getcwd(),
                        "./data/kinetics/predict"))
    parser.add_argument('--max_epochs', type=int, default=1)
    parser.add_argument('--learning_rate', type=float, default=1e-3)
    parser.add_argument('--gpus', type=int, default=None)
    parser.add_argument('--fast_dev_run', type=int, default=False)
    args = parser.parse_args()


    if args.download:
        # Dataset Credit:Download a video clip dataset.
        # Find more datasets at https://pytorchvideo.readthedocs.io/en/latest/data.html
        download_data("https://pl-flash-data.s3.amazonaws.com/kinetics.zip",
                      os.path.join(os.getcwd(), "data/"))

    datamodule = VideoClassificationData.from_folders(
        train_folder=args.train_folder,
        val_folder=args.val_folder,
        predict_folder=args.predict_folder,
        batch_size=8,
        clip_sampler="uniform",
        clip_duration=2,
        video_sampler=RandomSampler,
        decode_audio=False,
        num_workers=8,
    )

    model = VideoClassifier(backbone=args.backbone, num_classes=datamodule.num_classes, pretrained=False)

    trainer = flash.Trainer(max_epochs=args.max_epochs, gpus=args.gpus, strategy='ddp', fast_dev_run=args.fast_dev_run)
    trainer.finetune(model, datamodule=datamodule, strategy=NoFreeze())

Expected behavior

Training should be 🚀 working fine.

Environment

  • OS (e.g., Linux): Linux
  • Python version: 3.8
  • PyTorch/Lightning/Flash Version (e.g., 1.9.0a0+c3d40fd/1.5.10/0.8.0dev):
  • GPU models and configuration: 8 gpus
  • Any other relevant information: pytorch lightning NGC (has 1.4.0 default)

Additional context

Also tested same code using flash version 0.7.0 with fixes to input.py as mentioned in #1182

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ethanwharriscommented, Feb 28, 2022

Hey @dudeperf3ct sorry it still isn’t working for you. Could you open an issue for that one on the PL repo? All device handling is managed by PL so they should be able to provide better support 😃

1reaction
ethanwharriscommented, Feb 23, 2022

Hey @dudeperf3ct I have a fix in #1189 - unfortunately it’s not possible for us to support DDP spawn as PyTorchVideo models can’t be pickled, but I have fixed support for strategy="ddp" and added video classification to our GPU CI so this doesn’t fail again in future 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi gpu training fails using strategy='ddp' #1201 - GitHub
The error comes when running a video classification using flash. I ran by adding print statement before running flash.
Read more >
Training from scratch — Flash documentation
Training from scratch. Some Flash tasks have been pretrained on large data sets. To accelerate your training, calling the finetune() method using a ......
Read more >
Multiple GPU training in PyTorch using Hugging Face ...
Run a PyTorch model on multiple GPUs using the Hugging Face accelerate library on JarvisLabs.ai. ... Your browser can't play this video.
Read more >
A problem when using "multi-gpu" as "ExecutionEnvironment ...
I am experiencing weird problems when I use the “multi-gpu” as the “ExecutionEnvironment” in the training option for training a CNN.
Read more >
Distributed Deep Learning With PyTorch Lightning (Part 1)
It is error-prone and hence there is a risk that results don't ... With PyTorch Lightning, single node training with multiple GPUs is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found