question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DebugUnderflowOverflow crashes with Multi-GPU training

See original GitHub issue

Environment info

  • transformers version: 4.8.2
  • Platform: Linux-4.15.0-29-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.9.0+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help

@stas00 @sgugger

Information

Model I am using (Bert, XLNet …): Bart (but this is inessential)

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Instantiate a debug_utils.DebugUnderflowOverflow for a model (I do this with debug="underflow_overflow" to a TrainingArguments).
  2. Train using multi-GPU setup.
  3. the debug hook added by DebugUnderflowOverflow to run after forward() will crash because of a bad lookup in the class’s module_names dict.

This does not happen on single-GPU training.

Here’s an example that crashes if I run on my (4-GPU) machine, but does not crash if I restrict to a single GPU (by calling export CUDA_VISIBLE_DEVICES=0 before invoking the script):

from torch.utils.data import Dataset
from transformers import (BartForConditionalGeneration, BartModel, BartConfig,
                          Seq2SeqTrainingArguments, Seq2SeqTrainer)


class DummyDataset(Dataset):
  def __len__(self):
    return 5

  def __getitem__(self, idx):
    return {'input_ids': list(range(idx, idx+3)),
            'labels': list(range(idx, idx+3))}


def main():
  train_dataset = DummyDataset()
  config = BartConfig(vocab_size=10, max_position_embeddings=10, d_model=8,
                      encoder_layers=1, decoder_layers=1,
                      encoder_attention_heads=1, decoder_attention_heads=1,
                      decoder_ffn_dim=8, encoder_ffn_dim=8)
  model = BartForConditionalGeneration(config)
  args = Seq2SeqTrainingArguments(output_dir="tmp", do_train=True,
                                  debug="underflow_overflow")
  trainer = Seq2SeqTrainer(model=model, args=args, train_dataset=train_dataset)
  trainer.train()

if __name__ == '__main__':
  main()

I get the following stack trace on my multi-GPU machine:

Traceback (most recent call last):
  File "./kbp_dbg.py", line 29, in <module>
    main()
  File "./kbp_dbg.py", line 26, in main
    trainer.train()
  File "[...]/transformers/trainer.py", line 1269, in train
    tr_loss += self.training_step(model, inputs)
  File "[...]/transformers/trainer.py", line 1762, in training_step
    loss = self.compute_loss(model, inputs)
  File "[...]/transformers/trainer.py", line 1794, in compute_loss
    outputs = model(**inputs)
  File "[...]/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "[...]/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "[...]/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "[...]/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "[...]/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "[...]/torch/nn/modules/module.py", line 1071, in _call_impl
    result = forward_call(*input, **kwargs)
  File "[...]/transformers/models/bart/modeling_bart.py", line 1308, in forward
    return_dict=return_dict,
  File "[...]/torch/nn/modules/module.py", line 1071, in _call_impl
    result = forward_call(*input, **kwargs)
  File "[...]/transformers/models/bart/modeling_bart.py", line 1173, in forward
    return_dict=return_dict,
  File "[...]/torch/nn/modules/module.py", line 1071, in _call_impl
    result = forward_call(*input, **kwargs)
  File "[...]/transformers/models/bart/modeling_bart.py", line 756, in forward
    inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
  File "[...]/torch/nn/modules/module.py", line 1076, in _call_impl
    hook_result = hook(self, input, result)
  File "[...]/transformers/debug_utils.py", line 246, in forward_hook
    self.create_frame(module, input, output)
  File "[...]/transformers/debug_utils.py", line 193, in create_frame
    self.expand_frame(f"{self.prefix} {self.module_names[module]} {module.__class__.__name__}")
KeyError: Embedding(10, 8, padding_idx=1)

And, again, it completes totally fine if you restrict visibility to a single GPU via CUDA_VISIBLE_DEVICES before running this.

Having looked into it, I strongly suspect what’s happening is the following:

  • the DebugUnderflowOverflow instantiated in Trainer.train populates a module_names dict from nn.Modules to names in its constructor (link)
  • If multi-gpu training is enabled, nn.DataParallel is called (I think here in trainer.py).
  • this calls torch.nn.replicate on its forward pass (link), which I believe calls nn.Module._replicate_for_data_parallel() here, replicating the model once per GPU.
  • This results in different hash() values for the replicated nn.Module objects across different GPUs, in general.
  • Finally, the module_names lookup in create_frame() will crash on the GPUs, since the differently replicated nn.Module objects on which the forward-pass hooks are called now have different hash() values after multi-GPU replication.

(You can confirm that a module has different hash values after replication via the following snippet):

m = torch.nn.Embedding(7, 5)
m2 = m._replicate_for_data_parallel()
print(hash(m))
print(hash(m2))

Which will print something like:

8730470757753
8730462809617

Not sure what the best way to fix this is. If there’s a way to instantiate a different DebugUnderflowOverflow for each GPU, after replication, that will maybe solve this issue (since per-replica hashes will assumedly be consistent then), but I’m not sure if that’s feasible or the best way to do this.

One could also just make the miscreant f-string construction in create_frame have a .get() with a default as the module_names lookup, rather than using square brackets, but this would probably make the debugging traces too uninformative. So I figured I’d just open a bug

Expected behavior

I’d expect it to either not crash or print a fatal overflow/underflow debug unsupported for multi-GPU error or something. I originally thought this was a bug in the model code rather than a multi-GPU thing.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
kpichcommented, Jul 20, 2021

No, only reason I’m using DP rather than DDP is I’m using the Trainer framework with its default behaviors on a multi-GPU machine. Using DDP is a good suggestion, thanks!

0reactions
stas00commented, Jul 21, 2021

Nothing comes to mind as a simple fix at the moment, so for now let’s just do a clean assert as you suggested. https://github.com/huggingface/transformers/pull/12816

If someone is stuck and can’t use DDP we will revisit this.

And of course, if you or someone would like to work on an actual solution it’d be very welcome. I don’t see nn.DataParallel having any hooks, so most likely this will require overriding torch.nn.parallel.data_parallel.replicate to refresh the model references after the replication. You can see the source code here: https://pytorch.org/docs/stable/_modules/torch/nn/parallel/data_parallel.html#DataParallel So it can be done, but it is not really worth it, IMHO.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Debugging - Hugging Face
Underflow and Overflow Detection. This feature is currently available for PyTorch-only. For multi-GPU training it requires DDP ( torch.distributed.launch ).
Read more >
Multi-gpu crash (DX12) - Nsight Graphics
Nsight graphics crashes whenever I try to capture a frame from my DX12 DXR application. The problem only arises when I use NSight...
Read more >
Training tensorflow on multiple GPU crashes the computer
We are able to run the training code if we don't use all GPUs (up to 4, tested). But, once we use all...
Read more >
Memory debugging - Arm Forge User Guide Version 20.1.2
Crashing due to deallocation of the same memory block twice, deallocation via invalid pointers, and other invalid deallocations, for example deallocating a ...
Read more >
A Full Hardware Guide to Deep Learning - Tim Dettmers
If this price is for a 100% efficiency, then training such a net ... However, if you have multiple GPUs next to each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found