Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to run catalyst with distributed training?

See original GitHub issue

Hi, I am trying to run distributed training, but I havent had success yet. It seems that we need to run with command like: python -m torch.distributed.launch to start distributed training. However, I havent seen any documents related to this feature.

I played around with some settings as follow: My environment: 4 GPUs, I want to use 2 GPUs for distributed training.

config.yml:

distributed_params:
  opt_level: O1
  rank: 0

Bash file:

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=2,3

export MASTER_PORT=1235
export MASTER_ADDR=0.0.0.0
export WORLD_SIZE=1
export RANK=0

catalyst-dl run \
    --config=<config> \
    --logdir=$LOGDIR \
    --out_dir=$LOGDIR:str \
    --verbose

Results I can manage the program running by the setting above. However, there is only one GPU running. May be the reason is here. I wonder how can I make 2GPUs running in this situation.

Issue Analytics

State:
Created 4 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

ngxbaccommented, Oct 8, 2019

Thank for your inform.

I upgraded to version 19.10. However, the problem above still occurs. Here is my script:

#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=2,3
RUN_CONFIG=config.yml


export MASTER_ADDR="127.0.0.1"
export MASTER_PORT=29500
export WORLD_SIZE=2  # number of gpus


model=se_resnext50_32x4d
for fold in 0; do
    #stage 1
    log_name=${model}-mw-512-distributed-$fold
    LOGDIR=/logs/rsna/test/${log_name}/
    RANK=0 LOCAL_RANK=0  catalyst-dl run \
        --config=./configs/${RUN_CONFIG} \
        --logdir=$LOGDIR \
        --out_dir=$LOGDIR:str \
        --model_params/model_name=${model}:str \
        --monitoring_params/name=${log_name}:str \
        --verbose \
        --distributed_params/rank=0:int &

    sleep 5

    RANK=1 LOCAL_RANK=1 catalyst-dl run \
        --config=./configs/${RUN_CONFIG} \
        --logdir=$LOGDIR \
        --out_dir=$LOGDIR:str \
        --model_params/model_name=${model}:str \
        --monitoring_params/name=${log_name}:str \
        --verbose \
        --distributed_params/rank=1:int &
done

Here is my log:

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Image Size: [512, 512]
./csv/patient2_kfold/train_0.csv
Image Size: [512, 512]
./csv/patient2_kfold/train_0.csv
13435 5434 13435 10868
13435 5434 13435 10868
./csv/patient2_kfold/valid_0.csv
./csv/patient2_kfold/valid_0.csv
0/3 * Epoch (train):   0% 0/3343 [00:00<?, ?it/s]

Without distributed learn, number of mini-batch is 3343x2 exactly.

Here is nvidia-smi:

+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   38C    P0    61W / 300W |   5892MiB / 16128MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   38C    P0    52W / 300W |   3211MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

1reaction

Scitatorcommented, Oct 8, 2019

Hi,

Based on our internal experiments, it looks like now everything works like a charm. Could you please update the issue with last Catalyst version?

Top Results From Across the Web

Distributed training tutorial — Catalyst 22.04 documentation

If you have multiple GPUs, the most reliable way to utilize their full potential during training is to use the distributed package from...

Catalyst - - @Pytorch - framework for Deep Learning ... - Twitter

You get a training loop with metrics, model checkpointing, advanced logging and distributed training support without the boilerplate. Break the cycle - use...

Catalyst — A PyTorch Framework for Accelerated Deep ...

Before we start, let's visualise typical Deep Learning SGD train loop: ... deep learning techniques usage like distributed or mixed-precision training.

PyTorch Community Voices | Catalyst| Sergey Kolesnikov

... of Catalyst, a high-level PyTorch framework for Deep Learning Rese... ... advanced logging, and distributed training support without the ...

Sample Efficient Ensemble Learning with Catalyst.RL - arXiv

Main features of Catalyst.RL include large-scale asynchronous distributed training, efficient implementations of various RL algorithms and auxiliary tricks, ...