Training on during active learning is failing
See original GitHub issueFacing below error while training the model as part of active learning.
Please help me to resolve this.
[2022-02-16 14:34:49.352][ INFO](ignite.engine.engine.SupervisedTrainer) - Epoch: 1/50, Iter: 1/2 -- train_loss: 0.0717
[2022-02-16 14:34:51.410][ERROR](ignite.engine.engine.SupervisedTrainer) - Current run is terminating due to exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[2022-02-16 14:34:51.410][ERROR](ignite.engine.engine.SupervisedTrainer) - Exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 834, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/deepedit/interaction.py", line 111, in __call__
return engine._iteration(engine, batchdata)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 200, in _iteration
_compute_pred_loss()
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 186, in _compute_pred_loss
engine.state.output[Keys.PRED] = self.inferer(inputs, self.network, *args, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/inferers/inferer.py", line 83, in __call__
return network(inputs, *args, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[2022-02-16 14:34:51.411][ERROR](ignite.engine.engine.SupervisedTrainer) - Engine run is terminating due to exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[2022-02-16 14:34:51.412][ERROR](ignite.engine.engine.SupervisedTrainer) - Exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 744, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 848, in _run_once_on_dataset
self._handle_exception(e)
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 424, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/handlers/stats_handler.py", line 158, in exception_raised
raise e
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 834, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/deepedit/interaction.py", line 111, in __call__
return engine._iteration(engine, batchdata)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 200, in _iteration
_compute_pred_loss()
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 186, in _compute_pred_loss
engine.state.output[Keys.PRED] = self.inferer(inputs, self.network, *args, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/inferers/inferer.py", line 83, in __call__
return network(inputs, *args, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/interfaces/utils/app.py", line 132, in <module>
run_main()
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/interfaces/utils/app.py", line 117, in run_main
result = a.train(request)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/interfaces/app.py", line 347, in train
result = task(request, self.datastore())
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/tasks/train/basic_train.py", line 362, in __call__
torch.multiprocessing.spawn(main_worker, nprocs=world_size, args=(world_size, req, datalist, self))
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/tasks/train/basic_train.py", line 574, in main_worker
task.train(rank, world_size, request, datastore)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/tasks/train/basic_train.py", line 413, in train
context.trainer.run()
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 56, in run
super().run()
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/workflow.py", line 258, in run
super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 701, in run
return self._internal_run()
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 774, in _internal_run
self._handle_exception(e)
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 424, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/handlers/stats_handler.py", line 158, in exception_raised
raise e
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 744, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 848, in _run_once_on_dataset
self._handle_exception(e)
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 424, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/handlers/stats_handler.py", line 158, in exception_raised
raise e
File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 834, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/deepedit/interaction.py", line 111, in __call__
return engine._iteration(engine, batchdata)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 200, in _iteration
_compute_pred_loss()
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 186, in _compute_pred_loss
engine.state.output[Keys.PRED] = self.inferer(inputs, self.network, *args, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/inferers/inferer.py", line 83, in __call__
return network(inputs, *args, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
/opt/conda/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
Issue Analytics
- State:
- Created 2 years ago
- Comments:9
Top Results From Across the Web
Active Learning Leads to Higher Grades and Fewer Failing ...
Students in a traditional lecture course are 1.5 times more likely to fail, compared to students in courses with active learning.
Read more >Active Learning Not Associated with Student Learning in a ...
Students in introductory science courses often fail to learn fundamental scientific concepts ( · Extensive research shows that lectures using active learning can ......
Read more >When Active Learning Fails… and What to Do About It
This chapter discusses the four main factors that impede successful adoption of active learning: student resistance, instructor reluctance, administrative ...
Read more >How Active Learning Can Fail - Michael S. Evans
In some learning settings, missing class is not a big deal. Students can take home assignments, get the lecture notes from a classmate,...
Read more >Why is active learning so difficult to implement: The Turkish case
Abstract: This article aims to report how teacher education may promote active learning which is demanded by the current educational reform of Turkish ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@SachidanandAlle @diazandr3s . I had
monai 0.8.0
so i have updgraded to today’s version0.8.1
and also installedmonaillabel weekly
. And now training is working on multiple GPUs. But i dont which one has helped resolved my issue. any how thanks.I still have CUDA memory issue. I need to figure out which GPU based instance will work for me if i want to parallelly label and train the model in the backend.
Glad to hear this!