question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom MaskDINO training crashes with a RuntimeError: Global alloc not supported yet

See original GitHub issue

When I run: cd /home/jovyan/data/kamila/detrex && python tools/train_net.py --config-file projects/maskdino/configs/maskdino_r50_coco_instance_seg_50ep.py

I get following exception:

[12/06 07:56:38 d2.engine.train_loop]: Starting training from iteration 0
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
ERROR [12/06 07:56:48 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "tools/train_net_graffiti.py", line 95, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/maskdino.py", line 162, in forward
    losses = self.criterion(outputs, targets,mask_dict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/criterion.py", line 388, in forward
    indices = self.matcher(aux_outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 223, in forward
    return self.memory_efficient_forward(outputs, targets, cost)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 165, in memory_efficient_forward
    cost_dice = batch_dice_loss_jit(out_mask, tgt_mask)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Global alloc not supported yet

[12/06 07:56:48 d2.engine.hooks]: Overall training speed: 3 iterations in 0:00:04 (1.3375 s / it)
[12/06 07:56:48 d2.engine.hooks]: Total training time: 0:00:04 (0:00:00 on hooks)
[12/06 07:56:49 d2.utils.events]:  eta: 4 days, 17:05:31  iter: 5  total_loss: 109.9  loss_ce: 4.103  loss_mask: 1.045  loss_dice: 1.17  loss_bbox: 0.2272  loss_giou: 0.1185  loss_ce_dn: 0.3122  loss_mask_dn: 1.603  loss_dice_dn: 1.205  loss_bbox_dn: 0.5139  loss_giou_dn: 0.328  loss_ce_0: 3.23  loss_mask_0: 1.752  loss_dice_0: 1.341  loss_bbox_0: 0.132  loss_giou_0: 0.1527  loss_ce_dn_0: 0.3276  loss_mask_dn_0: 2.752  loss_dice_dn_0: 3.534  loss_bbox_dn_0: 1.171  loss_giou_dn_0: 0.8217  loss_ce_1: 3.531  loss_mask_1: 1.334  loss_dice_1: 0.9349  loss_bbox_1: 0.09554  loss_giou_1: 0.111  loss_ce_dn_1: 0.2215  loss_mask_dn_1: 1.734  loss_dice_dn_1: 1.629  loss_bbox_dn_1: 0.795  loss_giou_dn_1: 0.4993  loss_ce_2: 3.148  loss_mask_2: 1.184  loss_dice_2: 1.401  loss_bbox_2: 0.1707  loss_giou_2: 0.1778  loss_ce_dn_2: 0.3696  loss_mask_dn_2: 1.644  loss_dice_dn_2: 1.608  loss_bbox_dn_2: 0.6091  loss_giou_dn_2: 0.4159  loss_ce_3: 3.638  loss_mask_3: 1.113  loss_dice_3: 1.233  loss_bbox_3: 0.2145  loss_giou_3: 0.1722  loss_ce_dn_3: 0.3632  loss_mask_dn_3: 1.652  loss_dice_dn_3: 1.442  loss_bbox_dn_3: 0.5413  loss_giou_dn_3: 0.3912  loss_ce_4: 3.436  loss_mask_4: 1.092  loss_dice_4: 1.122  loss_bbox_4: 0.2232  loss_giou_4: 0.1488  loss_ce_dn_4: 0.297  loss_mask_dn_4: 1.637  loss_dice_dn_4: 1.272  loss_bbox_dn_4: 0.5192  loss_giou_dn_4: 0.3572  loss_ce_5: 3.891  loss_mask_5: 1.315  loss_dice_5: 1.075  loss_bbox_5: 0.2148  loss_giou_5: 0.1458  loss_ce_dn_5: 0.2682  loss_mask_dn_5: 1.616  loss_dice_dn_5: 1.213  loss_bbox_dn_5: 0.5183  loss_giou_dn_5: 0.3461  loss_ce_6: 3.985  loss_mask_6: 1.168  loss_dice_6: 1.15  loss_bbox_6: 0.245  loss_giou_6: 0.1321  loss_ce_dn_6: 0.2713  loss_mask_dn_6: 1.57  loss_dice_dn_6: 1.197  loss_bbox_dn_6: 0.514  loss_giou_dn_6: 0.3357  loss_ce_7: 4.099  loss_mask_7: 1.093  loss_dice_7: 1.2  loss_bbox_7: 0.2284  loss_giou_7: 0.1184  loss_ce_dn_7: 0.3286  loss_mask_dn_7: 1.586  loss_dice_dn_7: 1.212  loss_bbox_dn_7: 0.5173  loss_giou_dn_7: 0.3303  loss_ce_8: 4.038  loss_mask_8: 1.049  loss_dice_8: 1.206  loss_bbox_8: 0.2284  loss_giou_8: 0.1187  loss_ce_dn_8: 0.3104  loss_mask_dn_8: 1.605  loss_dice_dn_8: 1.197  loss_bbox_dn_8: 0.51  loss_giou_dn_8: 0.3282  loss_ce_interm: 3.285  loss_mask_interm: 1.477  loss_dice_interm: 1.158  loss_bbox_interm: 0.6809  loss_giou_interm: 0.4867  time: 1.1044  data_time: 0.1023  lr: 0.0001  max_mem: 19044M
Traceback (most recent call last):
  File "tools/train_net_graffiti.py", line 232, in <module>
    launch(
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "tools/train_net_graffiti.py", line 227, in main
    do_train(args, cfg)
  File "tools/train_net_graffiti.py", line 211, in do_train
    trainer.train(start_iter, cfg.train.max_iter)
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "tools/train_net_graffiti.py", line 95, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/maskdino.py", line 162, in forward
    losses = self.criterion(outputs, targets,mask_dict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/criterion.py", line 388, in forward
    indices = self.matcher(aux_outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 223, in forward
    return self.memory_efficient_forward(outputs, targets, cost)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 165, in memory_efficient_forward
    cost_dice = batch_dice_loss_jit(out_mask, tgt_mask)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Global alloc not supported yet

So far I figured out the reason why it may appear. It seems that the workaround that uses batch_dice_loss instead of batch_dice_loss_jit as discussed in the issue fixes it, however, the training speed increases.

Would really appreciate you looking at it.

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
FengLi-ustcommented, Dec 7, 2022

Have you solved your problem? I have merged your PR.

0reactions
alrightkamicommented, Dec 9, 2022

It does seem to solve it; however, not sure about the training speed yet. I created a PR for the fix

Read more comments on GitHub >

github_iconTop Results From Across the Web

A training problem about Global alloc not supported yet #4
Custom MaskDINO training crashes with a RuntimeError: Global alloc not supported yet IDEA-Research/detrex#161.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found