Checkpointing may cause the NCCL error
See original GitHub issueHi! I’m trying to use DDP with NCCL backend and I found out that checkpointing and buffers broadcasting may cause NCCL error and crash. The error looks like
dist._broadcast_coalesced(
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[4], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
It only appears if checkpointing during forward pass is enabled, I’ve never seen such an error when I use --ckpt_interval_minutes=0.0
https://github.com/speechbrain/speechbrain/blob/2e78d2cc9aca596a68ee54308534b572e702b13a/speechbrain/core.py#L1046
When checkpointing during forward pass is enabled time before crash correlates with ckpt_interval_minutes.
My configuration is:
4 TeslaV100 (but repro works with 2 of them too)
Ubuntu 20.04.2
CUDA 11.1
torch 1.10
speechbrain 0.5.9
I made a minimal repro. config.yaml:
name: ddp_crash_repro
output_folder: !ref experiments/<name>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/<name>_log.txt
batch_size: 64
seed: 3456
number_of_epochs: 500
__set_seed: !!python/object/apply:torch.manual_seed [!ref <seed>]
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
dataset_samples_count: 12800
dataset_features_count: 24
dataset_features_informative: 15
opt_class: !name:torch.optim.Adam
loss: !new:torch.nn.modules.loss.BCEWithLogitsLoss
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
counter: !ref <epoch_counter>
train.py
import sys
import speechbrain as sb
import torch
import torch.nn as nn
from hyperpyyaml import load_hyperpyyaml
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from torch.distributed.elastic.multiprocessing.errors import record
class TestClassifier(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.ModuleList()
self.layers.append(
nn.Conv1d(in_channels=1, out_channels=2, kernel_size=2, stride=2)
)
self.layers.append(nn.ReLU())
self.layers.append(nn.BatchNorm1d(2))
self.layers.append(nn.Conv1d(2, 4, 2, 2))
self.layers.append(nn.ReLU())
self.layers.append(nn.Conv1d(4, 8, 2, 2))
self.layers.append(nn.ReLU())
self.layers.append(nn.Conv1d(8, 1, 3))
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x
class TestBrain(sb.Brain):
def __init__(self, modules=None, opt_class=None, hparams=None, run_opts=None, checkpointer=None):
super().__init__(modules, opt_class, hparams, run_opts, checkpointer)
self.loss = hparams['loss']
@record
def fit(self,
epoch_counter,
train_set,
valid_set=None,
progressbar=None,
train_loader_kwargs={},
valid_loader_kwargs={},
):
super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
def compute_objectives(self, predictions, batch, stage):
_, labels = batch
return self.loss(predictions, labels.to(self.device))
def compute_forward(self, batch, stage):
data, _ = batch
return self.modules['model'](data.to(self.device)).squeeze()
def get_loaders():
seed = int(hparams['seed'])
X, y = make_classification(hparams['dataset_samples_count'], hparams['dataset_features_count'],
shuffle=False, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(X[:, None, :], y, test_size=0.2, shuffle=True,
random_state=seed)
train_loader = DataLoader(TensorDataset(torch.Tensor(X_train), torch.Tensor(y_train)),
batch_size=hparams['batch_size'], shuffle=False)
test_loader = DataLoader(TensorDataset(torch.Tensor(X_test), torch.Tensor(y_test)),
batch_size=hparams['batch_size'], shuffle=False)
return train_loader, test_loader
if __name__ == "__main__":
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
# Initialize ddp (useful only for multi-GPU DDP training)
sb.utils.distributed.ddp_init_group(run_opts)
with open(hparams_file) as fin:
hparams = load_hyperpyyaml(fin, overrides)
train_loader, test_loader = get_loaders()
modules = {'model': TestClassifier()}
brain = TestBrain(modules, hparams['opt_class'], hparams, run_opts, hparams['checkpointer'])
brain.fit(hparams['epoch_counter'], train_loader, test_loader)
It crashes in ~2mins if I run
TORCH_DISTRIBUTED_DEBUG=DETAIL CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py config.yaml --ckpt_interval_minutes=0.001 --distributed_launch --distributed_backend=nccl
Everything is ok if I run
TORCH_DISTRIBUTED_DEBUG=DETAIL CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py config.yaml --ckpt_interval_minutes=0.0 --distributed_launch --distributed_backend=nccl
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:5
Top GitHub Comments
Thanks for the report and especially for a minimum reproducing example! The error message doesn’t seem to be very helpful, but does it have some extra context, a traceback, could you paste that too?
I have an idea where this could come from. We’re using the time since last checkpoint to determine if we should checkpoint this - this will very slightly vary between the different processes. Eventually, there comes a batch during which one of these two processes enters the intra-epoch checkpoint saving block and the other does not. In the intra-epoch saving block, the run_on_main function has a
ddp.barrier
synchronisation routine. I think in this case we don’t need synchronisation, we just need to run the saving on the main process exclusively. If that is so, the fix would luckily be simple.It fixes the problem. Thank you!