Custom MaskDINO training crashes with a RuntimeError: Global alloc not supported yet
See original GitHub issueWhen I run:
cd /home/jovyan/data/kamila/detrex && python tools/train_net.py --config-file projects/maskdino/configs/maskdino_r50_coco_instance_seg_50ep.py
I get following exception:
[32m[12/06 07:56:38 d2.engine.train_loop]: [0mStarting training from iteration 0
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
[4m[5m[31mERROR[0m [32m[12/06 07:56:48 d2.engine.train_loop]: [0mException during training:
Traceback (most recent call last):
File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "tools/train_net_graffiti.py", line 95, in run_step
loss_dict = self.model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/maskdino.py", line 162, in forward
losses = self.criterion(outputs, targets,mask_dict)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/criterion.py", line 388, in forward
indices = self.matcher(aux_outputs, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 223, in forward
return self.memory_efficient_forward(outputs, targets, cost)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 165, in memory_efficient_forward
cost_dice = batch_dice_loss_jit(out_mask, tgt_mask)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Global alloc not supported yet
[32m[12/06 07:56:48 d2.engine.hooks]: [0mOverall training speed: 3 iterations in 0:00:04 (1.3375 s / it)
[32m[12/06 07:56:48 d2.engine.hooks]: [0mTotal training time: 0:00:04 (0:00:00 on hooks)
[32m[12/06 07:56:49 d2.utils.events]: [0m eta: 4 days, 17:05:31 iter: 5 total_loss: 109.9 loss_ce: 4.103 loss_mask: 1.045 loss_dice: 1.17 loss_bbox: 0.2272 loss_giou: 0.1185 loss_ce_dn: 0.3122 loss_mask_dn: 1.603 loss_dice_dn: 1.205 loss_bbox_dn: 0.5139 loss_giou_dn: 0.328 loss_ce_0: 3.23 loss_mask_0: 1.752 loss_dice_0: 1.341 loss_bbox_0: 0.132 loss_giou_0: 0.1527 loss_ce_dn_0: 0.3276 loss_mask_dn_0: 2.752 loss_dice_dn_0: 3.534 loss_bbox_dn_0: 1.171 loss_giou_dn_0: 0.8217 loss_ce_1: 3.531 loss_mask_1: 1.334 loss_dice_1: 0.9349 loss_bbox_1: 0.09554 loss_giou_1: 0.111 loss_ce_dn_1: 0.2215 loss_mask_dn_1: 1.734 loss_dice_dn_1: 1.629 loss_bbox_dn_1: 0.795 loss_giou_dn_1: 0.4993 loss_ce_2: 3.148 loss_mask_2: 1.184 loss_dice_2: 1.401 loss_bbox_2: 0.1707 loss_giou_2: 0.1778 loss_ce_dn_2: 0.3696 loss_mask_dn_2: 1.644 loss_dice_dn_2: 1.608 loss_bbox_dn_2: 0.6091 loss_giou_dn_2: 0.4159 loss_ce_3: 3.638 loss_mask_3: 1.113 loss_dice_3: 1.233 loss_bbox_3: 0.2145 loss_giou_3: 0.1722 loss_ce_dn_3: 0.3632 loss_mask_dn_3: 1.652 loss_dice_dn_3: 1.442 loss_bbox_dn_3: 0.5413 loss_giou_dn_3: 0.3912 loss_ce_4: 3.436 loss_mask_4: 1.092 loss_dice_4: 1.122 loss_bbox_4: 0.2232 loss_giou_4: 0.1488 loss_ce_dn_4: 0.297 loss_mask_dn_4: 1.637 loss_dice_dn_4: 1.272 loss_bbox_dn_4: 0.5192 loss_giou_dn_4: 0.3572 loss_ce_5: 3.891 loss_mask_5: 1.315 loss_dice_5: 1.075 loss_bbox_5: 0.2148 loss_giou_5: 0.1458 loss_ce_dn_5: 0.2682 loss_mask_dn_5: 1.616 loss_dice_dn_5: 1.213 loss_bbox_dn_5: 0.5183 loss_giou_dn_5: 0.3461 loss_ce_6: 3.985 loss_mask_6: 1.168 loss_dice_6: 1.15 loss_bbox_6: 0.245 loss_giou_6: 0.1321 loss_ce_dn_6: 0.2713 loss_mask_dn_6: 1.57 loss_dice_dn_6: 1.197 loss_bbox_dn_6: 0.514 loss_giou_dn_6: 0.3357 loss_ce_7: 4.099 loss_mask_7: 1.093 loss_dice_7: 1.2 loss_bbox_7: 0.2284 loss_giou_7: 0.1184 loss_ce_dn_7: 0.3286 loss_mask_dn_7: 1.586 loss_dice_dn_7: 1.212 loss_bbox_dn_7: 0.5173 loss_giou_dn_7: 0.3303 loss_ce_8: 4.038 loss_mask_8: 1.049 loss_dice_8: 1.206 loss_bbox_8: 0.2284 loss_giou_8: 0.1187 loss_ce_dn_8: 0.3104 loss_mask_dn_8: 1.605 loss_dice_dn_8: 1.197 loss_bbox_dn_8: 0.51 loss_giou_dn_8: 0.3282 loss_ce_interm: 3.285 loss_mask_interm: 1.477 loss_dice_interm: 1.158 loss_bbox_interm: 0.6809 loss_giou_interm: 0.4867 time: 1.1044 data_time: 0.1023 lr: 0.0001 max_mem: 19044M
Traceback (most recent call last):
File "tools/train_net_graffiti.py", line 232, in <module>
launch(
File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "tools/train_net_graffiti.py", line 227, in main
do_train(args, cfg)
File "tools/train_net_graffiti.py", line 211, in do_train
trainer.train(start_iter, cfg.train.max_iter)
File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "tools/train_net_graffiti.py", line 95, in run_step
loss_dict = self.model(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/maskdino.py", line 162, in forward
losses = self.criterion(outputs, targets,mask_dict)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/criterion.py", line 388, in forward
indices = self.matcher(aux_outputs, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 223, in forward
return self.memory_efficient_forward(outputs, targets, cost)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 165, in memory_efficient_forward
cost_dice = batch_dice_loss_jit(out_mask, tgt_mask)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Global alloc not supported yet
So far I figured out the reason why it may appear.
It seems that the workaround that uses batch_dice_loss
instead of batch_dice_loss_jit
as discussed in the issue fixes it, however, the training speed increases.
Would really appreciate you looking at it.
Issue Analytics
- State:
- Created 9 months ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
A training problem about Global alloc not supported yet #4
Custom MaskDINO training crashes with a RuntimeError: Global alloc not supported yet IDEA-Research/detrex#161.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Have you solved your problem? I have merged your PR.
It does seem to solve it; however, not sure about the training speed yet. I created a PR for the fix