Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training on during active learning is failing

See original GitHub issue

Facing below error while training the model as part of active learning.

Please help me to resolve this.

[2022-02-16 14:34:49.352][ INFO](ignite.engine.engine.SupervisedTrainer) - Epoch: 1/50, Iter: 1/2 -- train_loss: 0.0717
[2022-02-16 14:34:51.410][ERROR](ignite.engine.engine.SupervisedTrainer) - Current run is terminating due to exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[2022-02-16 14:34:51.410][ERROR](ignite.engine.engine.SupervisedTrainer) - Exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 834, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/deepedit/interaction.py", line 111, in __call__
    return engine._iteration(engine, batchdata)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 200, in _iteration
    _compute_pred_loss()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 186, in _compute_pred_loss
    engine.state.output[Keys.PRED] = self.inferer(inputs, self.network, *args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/inferers/inferer.py", line 83, in __call__
    return network(inputs, *args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[2022-02-16 14:34:51.411][ERROR](ignite.engine.engine.SupervisedTrainer) - Engine run is terminating due to exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[2022-02-16 14:34:51.412][ERROR](ignite.engine.engine.SupervisedTrainer) - Exception: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 744, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 848, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/handlers/stats_handler.py", line 158, in exception_raised
    raise e
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 834, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/deepedit/interaction.py", line 111, in __call__
    return engine._iteration(engine, batchdata)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 200, in _iteration
    _compute_pred_loss()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 186, in _compute_pred_loss
    engine.state.output[Keys.PRED] = self.inferer(inputs, self.network, *args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/inferers/inferer.py", line 83, in __call__
    return network(inputs, *args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/interfaces/utils/app.py", line 132, in <module>
    run_main()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/interfaces/utils/app.py", line 117, in run_main
    result = a.train(request)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/interfaces/app.py", line 347, in train
    result = task(request, self.datastore())
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/tasks/train/basic_train.py", line 362, in __call__
    torch.multiprocessing.spawn(main_worker, nprocs=world_size, args=(world_size, req, datalist, self))
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/tasks/train/basic_train.py", line 574, in main_worker
    task.train(rank, world_size, request, datastore)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/tasks/train/basic_train.py", line 413, in train
    context.trainer.run()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 56, in run
    super().run()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/workflow.py", line 258, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 701, in run
    return self._internal_run()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 774, in _internal_run
    self._handle_exception(e)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/handlers/stats_handler.py", line 158, in exception_raised
    raise e
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 744, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 848, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/handlers/stats_handler.py", line 158, in exception_raised
    raise e
  File "/home/ec2-user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 834, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monailabel/deepedit/interaction.py", line 111, in __call__
    return engine._iteration(engine, batchdata)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 200, in _iteration
    _compute_pred_loss()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/engines/trainer.py", line 186, in _compute_pred_loss
    engine.state.output[Keys.PRED] = self.inferer(inputs, self.network, *args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/monai/inferers/inferer.py", line 83, in __call__
    return network(inputs, *args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 35 36 37 38 39 40 41 42
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
/opt/conda/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown

Issue Analytics

State:
Created 2 years ago
Comments:9

Top GitHub Comments

2reactions

j-siegercommented, Feb 17, 2022

@SachidanandAlle @diazandr3s . I had monai 0.8.0 so i have updgraded to today’s version 0.8.1 and also installed monaillabel weekly. And now training is working on multiple GPUs. But i dont which one has helped resolved my issue. any how thanks.

I still have CUDA memory issue. I need to figure out which GPU based instance will work for me if i want to parallelly label and train the model in the backend.

1reaction

diazandr3scommented, Feb 17, 2022

Glad to hear this!

Top Results From Across the Web

Active Learning Leads to Higher Grades and Fewer Failing ...

Students in a traditional lecture course are 1.5 times more likely to fail, compared to students in courses with active learning.

Active Learning Not Associated with Student Learning in a ...

Students in introductory science courses often fail to learn fundamental scientific concepts ( · Extensive research shows that lectures using active learning can ......

When Active Learning Fails… and What to Do About It

This chapter discusses the four main factors that impede successful adoption of active learning: student resistance, instructor reluctance, administrative ...

How Active Learning Can Fail - Michael S. Evans

In some learning settings, missing class is not a big deal. Students can take home assignments, get the lecture notes from a classmate,...

Why is active learning so difficult to implement: The Turkish case

Abstract: This article aims to report how teacher education may promote active learning which is demanded by the current educational reform of Turkish ......