question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to run catalyst with distributed training?

See original GitHub issue

Hi, I am trying to run distributed training, but I havent had success yet. It seems that we need to run with command like: python -m torch.distributed.launch to start distributed training. However, I havent seen any documents related to this feature.

I played around with some settings as follow: My environment: 4 GPUs, I want to use 2 GPUs for distributed training.

  • config.yml:
distributed_params:
  opt_level: O1
  rank: 0
  • Bash file:
#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=2,3

export MASTER_PORT=1235
export MASTER_ADDR=0.0.0.0
export WORLD_SIZE=1
export RANK=0

catalyst-dl run \
    --config=<config> \
    --logdir=$LOGDIR \
    --out_dir=$LOGDIR:str \
    --verbose
  • Results I can manage the program running by the setting above. However, there is only one GPU running. May be the reason is here. I wonder how can I make 2GPUs running in this situation.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ngxbaccommented, Oct 8, 2019

Thank for your inform.

I upgraded to version 19.10. However, the problem above still occurs. Here is my script:

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=2,3
RUN_CONFIG=config.yml


export MASTER_ADDR="127.0.0.1"
export MASTER_PORT=29500
export WORLD_SIZE=2  # number of gpus


model=se_resnext50_32x4d
for fold in 0; do
    #stage 1
    log_name=${model}-mw-512-distributed-$fold
    LOGDIR=/logs/rsna/test/${log_name}/
    RANK=0 LOCAL_RANK=0  catalyst-dl run \
        --config=./configs/${RUN_CONFIG} \
        --logdir=$LOGDIR \
        --out_dir=$LOGDIR:str \
        --model_params/model_name=${model}:str \
        --monitoring_params/name=${log_name}:str \
        --verbose \
        --distributed_params/rank=0:int &

    sleep 5

    RANK=1 LOCAL_RANK=1 catalyst-dl run \
        --config=./configs/${RUN_CONFIG} \
        --logdir=$LOGDIR \
        --out_dir=$LOGDIR:str \
        --model_params/model_name=${model}:str \
        --monitoring_params/name=${log_name}:str \
        --verbose \
        --distributed_params/rank=1:int &
done

Here is my log:

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Image Size: [512, 512]
./csv/patient2_kfold/train_0.csv
Image Size: [512, 512]
./csv/patient2_kfold/train_0.csv
13435 5434 13435 10868
13435 5434 13435 10868
./csv/patient2_kfold/valid_0.csv
./csv/patient2_kfold/valid_0.csv
0/3 * Epoch (train):   0% 0/3343 [00:00<?, ?it/s]

Without distributed learn, number of mini-batch is 3343x2 exactly.

Here is nvidia-smi:

+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   38C    P0    61W / 300W |   5892MiB / 16128MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   38C    P0    52W / 300W |   3211MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

1reaction
Scitatorcommented, Oct 8, 2019

Hi,

Based on our internal experiments, it looks like now everything works like a charm. Could you please update the issue with last Catalyst version?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed training tutorial — Catalyst 22.04 documentation
If you have multiple GPUs, the most reliable way to utilize their full potential during training is to use the distributed package from...
Read more >
Catalyst - - @Pytorch - framework for Deep Learning ... - Twitter
You get a training loop with metrics, model checkpointing, advanced logging and distributed training support without the boilerplate. Break the cycle - use...
Read more >
Catalyst — A PyTorch Framework for Accelerated Deep ...
Before we start, let's visualise typical Deep Learning SGD train loop: ... deep learning techniques usage like distributed or mixed-precision training.
Read more >
PyTorch Community Voices | Catalyst| Sergey Kolesnikov
... of Catalyst, a high-level PyTorch framework for Deep Learning Rese... ... advanced logging, and distributed training support without the ...
Read more >
Sample Efficient Ensemble Learning with Catalyst.RL - arXiv
Main features of Catalyst.RL include large-scale asynchronous distributed training, efficient implementations of various RL algorithms and auxiliary tricks, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found