DebugUnderflowOverflow crashes with Multi-GPU training
See original GitHub issueEnvironment info
transformers
version: 4.8.2- Platform: Linux-4.15.0-29-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.9.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help
Information
Model I am using (Bert, XLNet …): Bart (but this is inessential)
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Instantiate a
debug_utils.DebugUnderflowOverflow
for a model (I do this withdebug="underflow_overflow"
to aTrainingArguments
). - Train using multi-GPU setup.
- the debug hook added by
DebugUnderflowOverflow
to run afterforward()
will crash because of a bad lookup in the class’smodule_names
dict.
This does not happen on single-GPU training.
Here’s an example that crashes if I run on my (4-GPU) machine, but does not crash if I restrict to a single GPU (by calling export CUDA_VISIBLE_DEVICES=0
before invoking the script):
from torch.utils.data import Dataset
from transformers import (BartForConditionalGeneration, BartModel, BartConfig,
Seq2SeqTrainingArguments, Seq2SeqTrainer)
class DummyDataset(Dataset):
def __len__(self):
return 5
def __getitem__(self, idx):
return {'input_ids': list(range(idx, idx+3)),
'labels': list(range(idx, idx+3))}
def main():
train_dataset = DummyDataset()
config = BartConfig(vocab_size=10, max_position_embeddings=10, d_model=8,
encoder_layers=1, decoder_layers=1,
encoder_attention_heads=1, decoder_attention_heads=1,
decoder_ffn_dim=8, encoder_ffn_dim=8)
model = BartForConditionalGeneration(config)
args = Seq2SeqTrainingArguments(output_dir="tmp", do_train=True,
debug="underflow_overflow")
trainer = Seq2SeqTrainer(model=model, args=args, train_dataset=train_dataset)
trainer.train()
if __name__ == '__main__':
main()
I get the following stack trace on my multi-GPU machine:
Traceback (most recent call last):
File "./kbp_dbg.py", line 29, in <module>
main()
File "./kbp_dbg.py", line 26, in main
trainer.train()
File "[...]/transformers/trainer.py", line 1269, in train
tr_loss += self.training_step(model, inputs)
File "[...]/transformers/trainer.py", line 1762, in training_step
loss = self.compute_loss(model, inputs)
File "[...]/transformers/trainer.py", line 1794, in compute_loss
outputs = model(**inputs)
File "[...]/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "[...]/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "[...]/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "[...]/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "[...]/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in replica 0 on device 0.
Original Traceback (most recent call last):
File "[...]/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "[...]/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "[...]/transformers/models/bart/modeling_bart.py", line 1308, in forward
return_dict=return_dict,
File "[...]/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "[...]/transformers/models/bart/modeling_bart.py", line 1173, in forward
return_dict=return_dict,
File "[...]/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "[...]/transformers/models/bart/modeling_bart.py", line 756, in forward
inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
File "[...]/torch/nn/modules/module.py", line 1076, in _call_impl
hook_result = hook(self, input, result)
File "[...]/transformers/debug_utils.py", line 246, in forward_hook
self.create_frame(module, input, output)
File "[...]/transformers/debug_utils.py", line 193, in create_frame
self.expand_frame(f"{self.prefix} {self.module_names[module]} {module.__class__.__name__}")
KeyError: Embedding(10, 8, padding_idx=1)
And, again, it completes totally fine if you restrict visibility to a single GPU via CUDA_VISIBLE_DEVICES
before running this.
Having looked into it, I strongly suspect what’s happening is the following:
- the
DebugUnderflowOverflow
instantiated inTrainer.train
populates amodule_names
dict from nn.Modules to names in its constructor (link) - If multi-gpu training is enabled,
nn.DataParallel
is called (I think here in trainer.py). - this calls
torch.nn.replicate
on its forward pass (link), which I believe callsnn.Module._replicate_for_data_parallel()
here, replicating the model once per GPU. - This results in different
hash()
values for the replicatednn.Module
objects across different GPUs, in general. - Finally, the
module_names
lookup increate_frame()
will crash on the GPUs, since the differently replicatednn.Module
objects on which the forward-pass hooks are called now have differenthash()
values after multi-GPU replication.
(You can confirm that a module has different hash values after replication via the following snippet):
m = torch.nn.Embedding(7, 5)
m2 = m._replicate_for_data_parallel()
print(hash(m))
print(hash(m2))
Which will print something like:
8730470757753
8730462809617
Not sure what the best way to fix this is. If there’s a way to instantiate a different DebugUnderflowOverflow
for each GPU, after replication, that will maybe solve this issue (since per-replica hashes will assumedly be consistent then), but I’m not sure if that’s feasible or the best way to do this.
One could also just make the miscreant f-string construction in create_frame
have a .get()
with a default as the module_names
lookup, rather than using square brackets, but this would probably make the debugging traces too uninformative. So I figured I’d just open a bug
Expected behavior
I’d expect it to either not crash or print a fatal overflow/underflow debug unsupported for multi-GPU
error or something. I originally thought this was a bug in the model code rather than a multi-GPU thing.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
No, only reason I’m using DP rather than DDP is I’m using the
Trainer
framework with its default behaviors on a multi-GPU machine. Using DDP is a good suggestion, thanks!Nothing comes to mind as a simple fix at the moment, so for now let’s just do a clean assert as you suggested. https://github.com/huggingface/transformers/pull/12816
If someone is stuck and can’t use DDP we will revisit this.
And of course, if you or someone would like to work on an actual solution it’d be very welcome. I don’t see
nn.DataParallel
having any hooks, so most likely this will require overridingtorch.nn.parallel.data_parallel.replicate
to refresh the model references after the replication. You can see the source code here: https://pytorch.org/docs/stable/_modules/torch/nn/parallel/data_parallel.html#DataParallel So it can be done, but it is not really worth it, IMHO.