question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

I am a researcher at Microsoft. When I use mmaction for training, I will prompt errors related to distributed training. I sincerely look forward to your reply. Thank you very much

See original GitHub issue

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug the log is : A clear and concise description of what the bug is. File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/torch/nn/modules/conv.py”, line 443, in _conv_forward self.padding, self.dilation, self.groups) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/apex/amp/wrap.py”, line 21, in wrapper args[i] = utils.cached_cast(cast_fn, args[i], handle.cache) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/apex/amp/utils.py”, line 97, in cached_cast if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25385) of binary: /home/kny/anaconda3/envs/mmd/bin/python Traceback (most recent call last): File “/home/kny/anaconda3/envs/mmd/lib/python3.7/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/torch/distributed/launch.py”, line 193, in <module> main() File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/torch/distributed/launch.py”, line 189, in main launch(args) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/torch/distributed/launch.py”, line 174, in launch run(args) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/torch/distributed/run.py”, line 713, in run )(*cmd_args) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File “/home/kny/anaconda3/envs/mmd/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Reproduction

  1. What command or script did you run? bash run.sh . `python train.py

  2. Did you make any modifications on the code or config? Did you understand what you have modified? no

  3. What dataset did you use?’ coco

Environment pytorch 1.10 nvidia 3090 cuda 113

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • pip
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) python 3.7 Error traceback If applicable, paste the error trackback here. I am a researcher at Microsoft. When I use mmaction for training, I will prompt errors related to distributed training. I sincerely look forward to your reply. Thank you very much
A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
xianglei3commented, Nov 26, 2021

yes,

1reaction
RangiLyucommented, Nov 26, 2021

Are you using apex to accelerate training? Try to not use apex to see whether it is caused by apex.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot Start Microsoft Outlook.
Hi, My company does IT support, we have a client who is running into an issue with outlook wherein he cannot open outlook...
Read more >
What is missing from MATLAB? - MathWorks
I'm curious, is there something you wish to do with MATLAB but you can't, maybe something you can do with other similar software...
Read more >
https://mail.python.org/pipermail/spambayes/2004-J...
After some initial training, you really shouldn't have to do very much of this. It sounds like you're not using the SpamBayes classification...
Read more >
impact build recipe: Topics by WorldWideScience.org
Each recipe offers a discussion of how and why the solution works, so you can quickly adapt it to fit your particular needs....
Read more >
Download book PDF - Springer
typically taught to urology residents during their training, this book serves as a ... third week, the majority of patients can look forward....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found