ERROR train.py: Default process group is not initialized
See original GitHub issueI get this error when training on a single GPU, when calling the function distributed()
to disable tqdm
.
To avoid this I have simple wrapped distributed
like:
def distributed():
try:
return dist.is_available() and dist.is_initialized()
except:
return False
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
Default process group is not initialized · Issue #131 · mapillary ...
I'm interested to know how the code is run: directly from the Python interpreter vs. ipython vs. a script launched with python script.py...
Read more >RuntimeError: Default process group has not been initialized ...
I'm training the model with DistributedDataParallel and made weight file. Then trying to load the pth file with model and eval
Read more >Error when using train.checkpoint - Ray
When I was trying to use the checkpoint in the ray train, I came across ... ERROR serialization.py:270 -- Default process group has...
Read more >How to solve dist.init_process_group from hanging (or ...
... default distributed process group, and this will also initialize the distributed package. dist.init_process_group(backend, rank=rank, ...
Read more >AssertionError: Default process group is not initialized
博主解决这个问题的方法为:如果项目里有分布式训练相关的代码,如果不使用分布式训练,就不要启动syncbn。
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jongwook yes I can make it working with the changes I did so far. I will further investigate NCLL by the way. Thanks.
Can you paste the output of the following bash script - to check your system information?
I suspect two possibilities: