question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ControlFlowCallback error in DDP

See original GitHub issue

🐛 Bug Report

ControlFlowCallback can’t be pickled because of lambdas in def _filter_fn_from_loaders.

It works fine when callbacks are initialized in def get_callbacks, but fails if callbacks are passed directly to SupervisedRunner.train method.

File "/usr/local/lib/python3.6/dist-packages/catalyst/runners/runner.py", line 515, in train
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 854, in run
    self._run_event("on_exception")
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 788, in _run_event
    getattr(self, event)(self)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 780, in on_exception
    raise self.exception
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 850, in run
    self._run_experiment()
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 840, in _run_experiment
    self.engine.spawn(self._run_stage)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/engines/torch.py", line 460, in spawn
    fn, args=(self._world_size,), nprocs=self._world_size, join=True
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
    process.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_filter_fn_from_loaders.<locals>.<lambda>

Environment

Collecting environment information...
Catalyst version: 21.09
PyTorch version: 1.9.1+cu102
Is debug build: No
CUDA used to build PyTorch: 10.2
TensorFlow version: N/A
TensorBoard version: 2.6.0

OS: linux
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 455.45.01
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] catalyst==21.9
[pip3] numpy==1.19.5
[pip3] tensorboard==2.6.0
[pip3] tensorboard-data-server==0.6.1
[pip3] tensorboard-plugin-wit==1.8.0
[pip3] tensorboardX==2.2
[pip3] torch==1.9.1
[pip3] torchvision==0.10.1
[conda] Could not collect

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Scitatorcommented, Oct 14, 2021

@asteyo could you please help with an issue? I think some refactoring of filtering-fns from

def _filter_fn_from_XXX({params}):
     {filter-logic}

to

class _filter_fn_from_XXX:
    def __init__(self, {params}):
        pass

    def __call__(self, stage, epoch, loader):
        {filter-logic}

should solve the issue 🚀

1reaction
ivan-chaicommented, Oct 6, 2021

Yes, I made callable and it is fine.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DDP with Hydra multirun doesn't work when dirpath ... - GitHub
Bug Running DDP with Hydra multirun ends up with "Killed" error ... has flow control causing later iterations to have unused parameters.
Read more >
RFC 5042 - Direct Data Placement Protocol (DDP) / Remote ...
For example, a callback function may be viewed simply as a very short queue. ... RNIC had flow control on generation of CQ...
Read more >
"Lightning out App error in callback function" when launching ...
When launching a flow using an URL button from a list view, I get the following error at the bottom of the screen...
Read more >
LightningModule - PyTorch Lightning - Read the Docs
To prevent an OOM error, it is possible to use BasePredictionWriter callback to write the predictions to disk or database after each batch...
Read more >
ddp_find_unused_parameters_f...
Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found