question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

aspect ratio grouping error

See original GitHub issue

❓ Questions and Help

I added a new loss and it works fine if I use a single GPU. However, it fails on “losses.backward()” if I use multiple GPUs. It seems this error relates to the “torch.distributed” The error information is below:

File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/home/maskrcnn_benchmark/engine/trainer.py", line 77, in do_train
    losses.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/deprecated/distributed.py", line 342, in reduction_fn_nccl
    group=self.nccl_reduction_group_id)
  File "/usr/local/lib/python3.5/dist-packages/torch/distributed/deprecated/__init__.py", line 317, in all_reduce_multigpu
    return torch._C._dist_all_reduce_multigpu(tensor_list, op, group)

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:21 (9 by maintainers)

github_iconTop GitHub Comments

7reactions
fmassacommented, Oct 29, 2018

Oh, there might be indeed a problem with the GroupedBatchSampler. As a quick workaround, I’d recommend setting the ASPECT_RATIO_GROUPING to False in the config. I’ll need to dig a bit further to identify in which contexts the issue you are facing arises.

1reaction
MhLiaocommented, Oct 30, 2018

That’s right. When setting the ASPECT_RATIO_GROUPING to False, everything is OK. I print the value of merged in this line But I can not find any differences between using a single GPU and using multiple GPUs. multple GPUs:

(tensor([9]), tensor([187]), tensor([63]), tensor([48]), tensor([159]), tensor([172]), tensor([176]), tensor([75]), tensor(
[221]), tensor([131]), tensor([56]), tensor([191]), tensor([99]), tensor([46]), tensor([80]), tensor([124]), tensor([161]),
 tensor([184]), tensor([166]), tensor([141]), tensor([155]), tensor([175]), tensor([214]), tensor([89]), tensor([93]), tens
or([144]), tensor([64]), tensor([69]), tensor([174]))
(tensor([109]), tensor([200]), tensor([211]), tensor([189]), tensor([17]), tensor([59]), tensor([104]), tensor([31]), tenso
r([180]), tensor([137]), tensor([51]), tensor([5]), tensor([183]), tensor([44]), tensor([60]), tensor([138]), tensor([158])
, tensor([15]), tensor([185]), tensor([30]), tensor([142]), tensor([204]), tensor([216]), tensor([206]), tensor([190]), ten
sor([165]), tensor([164]), tensor([24]), tensor([111]))
(tensor([122]), tensor([121]), tensor([209]), tensor([133]), tensor([162]), tensor([81]), tensor([227]), tensor([128]), ten
sor([57]), tensor([68]), tensor([218]), tensor([169]), tensor([21]), tensor([149]), tensor([47]), tensor([156]), tensor([8]
), tensor([148]), tensor([18]), tensor([207]), tensor([62]), tensor([210]), tensor([73]), tensor([12]), tensor([192]), tens
or([103]), tensor([96]), tensor([107]), tensor([152]))
(tensor([123]), tensor([130]), tensor([113]), tensor([153]), tensor([32]), tensor([181]), tensor([170]), tensor([222]), ten
sor([7]), tensor([115]), tensor([91]), tensor([61]), tensor([199]), tensor([43]), tensor([22]), tensor([19]), tensor([26]),
 tensor([145]), tensor([49]), tensor([127]), tensor([88]), tensor([28]), tensor([53]), tensor([208]), tensor([114]), tensor
([100]), tensor([194]), tensor([215]), tensor([39]))
(tensor([114]), tensor([100]), tensor([194]), tensor([151]), tensor([92]), tensor([224]), tensor([219]), tensor([182]), ten
sor([116]), tensor([72]), tensor([87]), tensor([71]), tensor([90]), tensor([52]), tensor([117]), tensor([27]), tensor([157]
), tensor([45]), tensor([97]), tensor([112]), tensor([220]), tensor([140]), tensor([84]), tensor([193]), tensor([173]), ten
sor([78]), tensor([34]), tensor([226]), tensor([79]), tensor([], dtype=torch.int64))
(tensor([177]), tensor([106]), tensor([14]), tensor([203]), tensor([83]), tensor([205]), tensor([74]), tensor([129]), tenso
r([86]), tensor([38]), tensor([225]), tensor([201]), tensor([147]), tensor([120]), tensor([101]), tensor([217]), tensor([20
]), tensor([160]), tensor([23]), tensor([29]), tensor([6]), tensor([65]), tensor([212]), tensor([171]), tensor([198]), tens
or([40]), tensor([10]), tensor([94]), tensor([126]))
(tensor([146]), tensor([167]), tensor([95]), tensor([2]), tensor([36]), tensor([3]), tensor([35]), tensor([119]), tensor([4
2]), tensor([41]), tensor([1]), tensor([82]), tensor([228]), tensor([143]), tensor([196]), tensor([50]), tensor([33]), tens
or([195]), tensor([202]), tensor([54]), tensor([150]), tensor([58]), tensor([0]), tensor([16]), tensor([135]), tensor([125]
), tensor([188]), tensor([163]), tensor([108]))
(tensor([197]), tensor([37]), tensor([178]), tensor([118]), tensor([98]), tensor([4]), tensor([67]), tensor([136]), tensor(
[132]), tensor([168]), tensor([186]), tensor([77]), tensor([13]), tensor([223]), tensor([11]), tensor([134]), tensor([66]),
 tensor([179]), tensor([55]), tensor([70]), tensor([154]), tensor([102]), tensor([213]), tensor([110]), tensor([76]), tenso
r([139]), tensor([105]), tensor([25]), tensor([85]))

Single GPU:

(tensor([67]), tensor([104]), tensor([44]), tensor([59]), tensor([190]), tensor([187]), tensor([12]), tensor([65]), tensor(
[2]), tensor([26]), tensor([92]), tensor([221]), tensor([198]), tensor([34]), tensor([32]), tensor([61]), tensor([71]), ten
sor([156]), tensor([131]), tensor([178]), tensor([49]), tensor([121]), tensor([136]), tensor([188]), tensor([135]), tensor(
[123]), tensor([64]), tensor([179]), tensor([142]), tensor([83]), tensor([79]), tensor([109]), tensor([127]), tensor([48]),
 tensor([11]), tensor([163]), tensor([118]), tensor([52]), tensor([66]), tensor([170]), tensor([84]), tensor([63]), tensor(
[186]), tensor([87]), tensor([96]), tensor([207]), tensor([195]), tensor([191]), tensor([103]), tensor([211]), tensor([101]
), tensor([138]), tensor([75]), tensor([114]), tensor([20]), tensor([201]), tensor([143]), tensor([141]), tensor([177]), te
nsor([76]), tensor([95]), tensor([113]), tensor([112]), tensor([51]), tensor([23]), tensor([46]), tensor([157]), tensor([19
6]), tensor([228]), tensor([199]), tensor([153]), tensor([145]), tensor([205]), tensor([159]), tensor([45]), tensor([9]), t
ensor([224]), tensor([4]), tensor([144]), tensor([100]), tensor([81]), tensor([214]), tensor([154]), tensor([173]), tensor(
[150]), tensor([7]), tensor([91]), tensor([42]), tensor([184]), tensor([164]), tensor([213]), tensor([62]), tensor([115]),
tensor([53]), tensor([148]), tensor([18]), tensor([110]), tensor([133]), tensor([89]), tensor([47]), tensor([158]), tensor(
[200]), tensor([217]), tensor([220]), tensor([194]), tensor([5]), tensor([175]), tensor([226]), tensor([28]), tensor([222])
, tensor([19]), tensor([29]), tensor([146]), tensor([82]), tensor([204]), tensor([60]), tensor([15]), tensor([165]), tensor
([192]), tensor([223]), tensor([202]), tensor([90]), tensor([203]), tensor([225]), tensor([68]), tensor([216]), tensor([30]
), tensor([149]), tensor([209]), tensor([210]), tensor([77]), tensor([6]), tensor([193]), tensor([116]), tensor([78]), tens
or([122]), tensor([147]), tensor([168]), tensor([180]), tensor([160]), tensor([128]), tensor([72]), tensor([93]), tensor([2
2]), tensor([55]), tensor([139]), tensor([13]), tensor([182]), tensor([212]), tensor([73]), tensor([10]), tensor([130]), te
nsor([137]), tensor([98]), tensor([183]), tensor([86]), tensor([125]), tensor([151]), tensor([169]), tensor([197]), tensor(
[107]), tensor([172]), tensor([161]), tensor([124]), tensor([102]), tensor([41]), tensor([185]), tensor([132]), tensor([140
]), tensor([35]), tensor([57]), tensor([166]), tensor([181]), tensor([40]), tensor([50]), tensor([88]), tensor([227]), tens
or([74]), tensor([58]), tensor([97]), tensor([208]), tensor([56]), tensor([176]), tensor([36]), tensor([206]), tensor([171]
), tensor([33]), tensor([117]), tensor([105]), tensor([155]), tensor([17]), tensor([219]), tensor([54]), tensor([70]), tens
or([21]), tensor([16]), tensor([43]), tensor([129]), tensor([119]), tensor([167]), tensor([0]), tensor([80]), tensor([120])
, tensor([38]), tensor([1]), tensor([189]), tensor([218]), tensor([106]), tensor([99]), tensor([27]), tensor([162]), tensor
([37]), tensor([3]), tensor([8]), tensor([134]), tensor([31]), tensor([14]), tensor([152]), tensor([111]), tensor([25]), te
nsor([85]), tensor([69]), tensor([24]), tensor([39]), tensor([174]), tensor([108]), tensor([215]), tensor([126]), tensor([9
4]))
Read more comments on GitHub >

github_iconTop Results From Across the Web

Post Error: ASPECT_RATIO_NOT_ALLOWED - WordStream HQ
If you've encountered an error message stating ASPECT_RATIO_NOT_ALLOWED, this error occurs when at least one of your logo images does not successfully crop...
Read more >
aspect-ratio - CSS: Cascading Style Sheets - MDN Web Docs
The aspect-ratio CSS property sets a preferred aspect ratio for the box, which will be used in the calculation of auto sizes and...
Read more >
Aspect ratio - CENOS Documentation
You can find this group named Aspect ratio error elements under Groups of Faces/Volumes. Aspect ratio error elements. Click the Eye icon (...
Read more >
I have some problem with aspect ratio NaN - OutSystems
I'm having some problems with the aspect ratio in NaN, When I set the aspect ratio to NaN it gives me some error...
Read more >
Powerpoint - re-scaling multiple images not working
1. I saw from the screenshot that the Lock Aspect Ratio checkbox is unchecked, so that's not the problem. 2. Try grouping the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found