Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Checkpointing may cause the NCCL error

See original GitHub issue

Hi! I’m trying to use DDP with NCCL backend and I found out that checkpointing and buffers broadcasting may cause NCCL error and crash. The error looks like

      dist._broadcast_coalesced(
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[4], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))

It only appears if checkpointing during forward pass is enabled, I’ve never seen such an error when I use --ckpt_interval_minutes=0.0 https://github.com/speechbrain/speechbrain/blob/2e78d2cc9aca596a68ee54308534b572e702b13a/speechbrain/core.py#L1046 When checkpointing during forward pass is enabled time before crash correlates with ckpt_interval_minutes. My configuration is:

4 TeslaV100 (but repro works with 2 of them too)
Ubuntu 20.04.2
CUDA 11.1
torch 1.10
speechbrain 0.5.9

I made a minimal repro. config.yaml:

name: ddp_crash_repro
output_folder: !ref experiments/<name>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/<name>_log.txt

batch_size: 64
seed: 3456
number_of_epochs: 500

__set_seed: !!python/object/apply:torch.manual_seed [!ref <seed>]

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
  save_file: !ref <train_log>

dataset_samples_count: 12800
dataset_features_count: 24
dataset_features_informative: 15

opt_class: !name:torch.optim.Adam

loss: !new:torch.nn.modules.loss.BCEWithLogitsLoss

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
  limit: !ref <number_of_epochs>

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
  checkpoints_dir: !ref <save_folder>
  recoverables:
    counter: !ref <epoch_counter>

train.py

import sys

import speechbrain as sb
import torch
import torch.nn as nn
from hyperpyyaml import load_hyperpyyaml
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from torch.distributed.elastic.multiprocessing.errors import record


class TestClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList()
        self.layers.append(
            nn.Conv1d(in_channels=1, out_channels=2, kernel_size=2, stride=2)
        )
        self.layers.append(nn.ReLU())
        self.layers.append(nn.BatchNorm1d(2))
        self.layers.append(nn.Conv1d(2, 4, 2, 2))
        self.layers.append(nn.ReLU())
        self.layers.append(nn.Conv1d(4, 8, 2, 2))
        self.layers.append(nn.ReLU())
        self.layers.append(nn.Conv1d(8, 1, 3))

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x


class TestBrain(sb.Brain):
    def __init__(self, modules=None, opt_class=None, hparams=None, run_opts=None, checkpointer=None):
        super().__init__(modules, opt_class, hparams, run_opts, checkpointer)
        self.loss = hparams['loss']

    @record
    def fit(self,
        epoch_counter,
        train_set,
        valid_set=None,
        progressbar=None,
        train_loader_kwargs={},
        valid_loader_kwargs={},
    ):
        super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)

    def compute_objectives(self, predictions, batch, stage):
        _, labels = batch
        return self.loss(predictions, labels.to(self.device))

    def compute_forward(self, batch, stage):
        data, _ = batch
        return self.modules['model'](data.to(self.device)).squeeze()


def get_loaders():
    seed = int(hparams['seed'])
    X, y = make_classification(hparams['dataset_samples_count'], hparams['dataset_features_count'],
                               shuffle=False, random_state=seed)

    X_train, X_test, y_train, y_test = train_test_split(X[:, None, :], y, test_size=0.2, shuffle=True,
                                                        random_state=seed)

    train_loader = DataLoader(TensorDataset(torch.Tensor(X_train), torch.Tensor(y_train)),
                              batch_size=hparams['batch_size'], shuffle=False)
    test_loader = DataLoader(TensorDataset(torch.Tensor(X_test), torch.Tensor(y_test)),
                             batch_size=hparams['batch_size'], shuffle=False)
    return train_loader, test_loader


if __name__ == "__main__":
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    train_loader, test_loader = get_loaders()

    modules = {'model': TestClassifier()}

    brain = TestBrain(modules, hparams['opt_class'], hparams, run_opts, hparams['checkpointer'])

    brain.fit(hparams['epoch_counter'], train_loader, test_loader)

It crashes in ~2mins if I run TORCH_DISTRIBUTED_DEBUG=DETAIL CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py config.yaml --ckpt_interval_minutes=0.001 --distributed_launch --distributed_backend=nccl Everything is ok if I run TORCH_DISTRIBUTED_DEBUG=DETAIL CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py config.yaml --ckpt_interval_minutes=0.0 --distributed_launch --distributed_backend=nccl

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:5

Top GitHub Comments

3reactions

Gastroncommented, Nov 26, 2021

Thanks for the report and especially for a minimum reproducing example! The error message doesn’t seem to be very helpful, but does it have some extra context, a traceback, could you paste that too?

I have an idea where this could come from. We’re using the time since last checkpoint to determine if we should checkpoint this - this will very slightly vary between the different processes. Eventually, there comes a batch during which one of these two processes enters the intra-epoch checkpoint saving block and the other does not. In the intra-epoch saving block, the run_on_main function has a ddp.barrier synchronisation routine. I think in this case we don’t need synchronisation, we just need to run the saving on the main process exclusively. If that is so, the fix would luckily be simple.

1reaction

kokamidocommented, Nov 27, 2021

It fixes the problem. Thank you!

Top Results From Across the Web

distributed training on two machine gets error when load ...

Describe the bug I'm trying to run distributed training on two machine with 3 gpu cards each, distributed training on single node works...

Model Parallel Troubleshooting - Amazon ... - 亚马逊云科技

Receiving NCCL Error for a PyTorch Training Job. If you encountered the following error, it might be due to a process running out...

Error for run a ready project with pytorch - distributed-rpc

I am working on a ready project https://github.com/microsoft/TAP This project is about text visual question answering.

Model Parallel Troubleshooting - Amazon SageMaker

Receiving NCCL Error for a PyTorch Training Job. If you encountered the following error, it might be due to a process running out...

Limitations and known issues found in WML Accelerator - IBM

These limitations and known problems exist in WML Accelerator 1.2.3. TensorFlow v2 model training fails (error: CUDNN_STATUS_INTERNAL_ERROR). TensorFlow ...