How to run catalyst with distributed training?
See original GitHub issueHi,
I am trying to run distributed training, but I havent had success yet.
It seems that we need to run with command like: python -m torch.distributed.launch
to start distributed training. However, I havent seen any documents related to this feature.
I played around with some settings as follow: My environment: 4 GPUs, I want to use 2 GPUs for distributed training.
- config.yml:
distributed_params:
opt_level: O1
rank: 0
- Bash file:
#!/usr/bin/env bash
export CUDA_VISIBLE_DEVICES=2,3
export MASTER_PORT=1235
export MASTER_ADDR=0.0.0.0
export WORLD_SIZE=1
export RANK=0
catalyst-dl run \
--config=<config> \
--logdir=$LOGDIR \
--out_dir=$LOGDIR:str \
--verbose
- Results I can manage the program running by the setting above. However, there is only one GPU running. May be the reason is here. I wonder how can I make 2GPUs running in this situation.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Distributed training tutorial — Catalyst 22.04 documentation
If you have multiple GPUs, the most reliable way to utilize their full potential during training is to use the distributed package from...
Read more >Catalyst - - @Pytorch - framework for Deep Learning ... - Twitter
You get a training loop with metrics, model checkpointing, advanced logging and distributed training support without the boilerplate. Break the cycle - use...
Read more >Catalyst — A PyTorch Framework for Accelerated Deep ...
Before we start, let's visualise typical Deep Learning SGD train loop: ... deep learning techniques usage like distributed or mixed-precision training.
Read more >PyTorch Community Voices | Catalyst| Sergey Kolesnikov
... of Catalyst, a high-level PyTorch framework for Deep Learning Rese... ... advanced logging, and distributed training support without the ...
Read more >Sample Efficient Ensemble Learning with Catalyst.RL - arXiv
Main features of Catalyst.RL include large-scale asynchronous distributed training, efficient implementations of various RL algorithms and auxiliary tricks, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank for your inform.
I upgraded to version 19.10. However, the problem above still occurs. Here is my script:
Here is my log:
Without distributed learn, number of mini-batch is
3343x2
exactly.Here is
nvidia-smi
:Hi,
Based on our internal experiments, it looks like now everything works like a charm. Could you please update the issue with last Catalyst version?