Multi-gpu training using flash fails for video classification
See original GitHub issue🐛 Bug
Using strategy='ddp'
Traceback (most recent call last):
File "/workspace/test_train.py", line 50, in <module>
trainer.finetune(model, datamodule=datamodule, strategy=NoFreeze())
File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 163, in finetune
return super().fit(model, train_dataloader, val_dataloaders, datamodule)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 140, in run
self.on_run_start(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in on_run_start
self.trainer.call_hook("on_train_start")
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
output = model_fx(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/flash/video/classification/model.py", line 123, in on_train_start
encoded_dataset = self.trainer.train_dataloader.loaders.dataset.dataset
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 163, in __getattr__
raise AttributeError
AttributeError
Removing strategy flag in Trainer i.e. using default strategy="ddp_spawn"
Traceback (most recent call last):
File "test_train.py", line 50, in <module>
trainer.finetune(model, datamodule=datamodule, strategy=NoFreeze())
File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 163, in finetune
return super().fit(model, train_dataloader, val_dataloaders, datamodule)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training
self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn
mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
process.start()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'torch._C.Generator' object
Training works fine using 1 gpu.
To Reproduce
python3 test_train.py --gpus 8 --download True --backbone i3d_r50
Code sample
import os
from argparse import ArgumentParser
from torch.utils.data.sampler import RandomSampler
import flash
from flash.core.finetuning import NoFreeze
from flash.core.data.utils import download_data
from flash.video import VideoClassificationData, VideoClassifier
if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('--seed', type=int, default=1234)
parser.add_argument('--backbone', type=str, default="x3d_xs")
parser.add_argument('--download', type=bool, default=True)
parser.add_argument('--train_folder', type=str, default=os.path.join(os.getcwd(),
"./data/kinetics/train"))
parser.add_argument('--val_folder', type=str, default=os.path.join(os.getcwd(),
"./data/kinetics/val"))
parser.add_argument('--predict_folder', type=str, default=os.path.join(os.getcwd(),
"./data/kinetics/predict"))
parser.add_argument('--max_epochs', type=int, default=1)
parser.add_argument('--learning_rate', type=float, default=1e-3)
parser.add_argument('--gpus', type=int, default=None)
parser.add_argument('--fast_dev_run', type=int, default=False)
args = parser.parse_args()
if args.download:
# Dataset Credit:Download a video clip dataset.
# Find more datasets at https://pytorchvideo.readthedocs.io/en/latest/data.html
download_data("https://pl-flash-data.s3.amazonaws.com/kinetics.zip",
os.path.join(os.getcwd(), "data/"))
datamodule = VideoClassificationData.from_folders(
train_folder=args.train_folder,
val_folder=args.val_folder,
predict_folder=args.predict_folder,
batch_size=8,
clip_sampler="uniform",
clip_duration=2,
video_sampler=RandomSampler,
decode_audio=False,
num_workers=8,
)
model = VideoClassifier(backbone=args.backbone, num_classes=datamodule.num_classes, pretrained=False)
trainer = flash.Trainer(max_epochs=args.max_epochs, gpus=args.gpus, strategy='ddp', fast_dev_run=args.fast_dev_run)
trainer.finetune(model, datamodule=datamodule, strategy=NoFreeze())
Expected behavior
Training should be 🚀 working fine.
Environment
- OS (e.g., Linux): Linux
- Python version: 3.8
- PyTorch/Lightning/Flash Version (e.g., 1.9.0a0+c3d40fd/1.5.10/0.8.0dev):
- GPU models and configuration: 8 gpus
- Any other relevant information: pytorch lightning NGC (has 1.4.0 default)
Additional context
Also tested same code using flash version 0.7.0 with fixes to input.py as mentioned in #1182
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Multi gpu training fails using strategy='ddp' #1201 - GitHub
The error comes when running a video classification using flash. I ran by adding print statement before running flash.
Read more >Training from scratch — Flash documentation
Training from scratch. Some Flash tasks have been pretrained on large data sets. To accelerate your training, calling the finetune() method using a ......
Read more >Multiple GPU training in PyTorch using Hugging Face ...
Run a PyTorch model on multiple GPUs using the Hugging Face accelerate library on JarvisLabs.ai. ... Your browser can't play this video.
Read more >A problem when using "multi-gpu" as "ExecutionEnvironment ...
I am experiencing weird problems when I use the “multi-gpu” as the “ExecutionEnvironment” in the training option for training a CNN.
Read more >Distributed Deep Learning With PyTorch Lightning (Part 1)
It is error-prone and hence there is a risk that results don't ... With PyTorch Lightning, single node training with multiple GPUs is...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Hey @dudeperf3ct sorry it still isn’t working for you. Could you open an issue for that one on the PL repo? All device handling is managed by PL so they should be able to provide better support 😃
Hey @dudeperf3ct I have a fix in #1189 - unfortunately it’s not possible for us to support DDP spawn as PyTorchVideo models can’t be pickled, but I have fixed support for
strategy="ddp"and added video classification to our GPU CI so this doesn’t fail again in future 😃