Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LayoutLMv2 model not supporting training on more than 1 GPU when using PyTorch Data Parallel

See original GitHub issue

Environment info

transformers version: 4.11.2
Platform: Linux-5.4.0-66-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.9.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help

Models: LayoutLMv2 @NielsRogge

Information

Model I am using: LayoutLMv2

The problem arises when using:

my own modified scripts

The tasks I am working on is:

token classification FUNSD

To reproduce

Steps to reproduce the behavior:

Run the below script with more than 1 GPU

from datasets import load_dataset 
import torch
from torch.nn import DataParallel
from PIL import Image
from transformers import LayoutLMv2Processor
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D
from torch.utils.data import DataLoader
from transformers import LayoutLMv2ForTokenClassification, AdamW
import torch
from tqdm.notebook import tqdm
from datasets import load_metric

use_cuda = torch.cuda.is_available()
device= torch.device('cuda:0' if use_cuda else 'cpu')
print(device)
device_ids = [0,1]

datasets = load_dataset("nielsr/funsd")

labels = datasets['train'].features['ner_tags'].feature.names
print(labels)

id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}

##Next, let's use `LayoutLMv2Processor` to prepare the data for the model.

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

# we need to define custom features
features = Features({
    'image': Array3D(dtype="int64", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'token_type_ids': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': Sequence(ClassLabel(names=labels)),
})

def preprocess_data(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image_path']]
  words = examples['words']
  boxes = examples['bboxes']
  word_labels = examples['ner_tags']
  
  encoded_inputs = processor(images, words, boxes=boxes, word_labels=word_labels,
                             padding="max_length", truncation=True)
  
  return encoded_inputs

train_dataset = datasets['train'].map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names,
                                      features=features)
test_dataset = datasets['test'].map(preprocess_data, batched=True, remove_columns=datasets['test'].column_names,
                                      features=features)

processor.tokenizer.decode(train_dataset['input_ids'][0])

print(train_dataset['labels'][0])

##Finally, let's set the format to PyTorch, and place everything on the GPU:

train_dataset.set_format(type="torch", device=device)
test_dataset.set_format(type="torch", device=device)

train_dataset.features.keys()

##Next, we create corresponding dataloaders.

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=2)

##Let's verify a batch:

batch = next(iter(train_dataloader))

for k,v in batch.items():
  print(k, v.shape)

## Train the model
##Here we train the model in native PyTorch. We use the AdamW optimizer.

model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutlmv2-base-uncased',
                                                          num_labels=len(labels))

if use_cuda:
    model = DataParallel(model,device_ids=device_ids)

model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)

global_step = 0
num_train_epochs = 6
t_total = len(train_dataloader) * num_train_epochs # total number of training steps 

#put the model in training mode
model.train() 
for epoch in range(num_train_epochs):  
   print("Epoch:", epoch)
   for batch in tqdm(train_dataloader):
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(**batch) 
        loss = outputs.loss
        
        # print loss every 100 steps
        if global_step % 100 == 0:
          print(f"Loss after {global_step} steps: {loss.item()}")

        loss.backward()
        optimizer.step()
        global_step += 1

## Evaluation

#Next, let's evaluate the model on the test set.

metric = load_metric("seqeval")

# put model in evaluation mode
model.eval()
for batch in tqdm(test_dataloader, desc="Evaluating"):
    with torch.no_grad():
        input_ids = batch['input_ids'].to(device)
        bbox = batch['bbox'].to(device)
        image = batch['image'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)

        # forward pass
        outputs = model(input_ids=input_ids, bbox=bbox, image=image, attention_mask=attention_mask, 
                        token_type_ids=token_type_ids, labels=labels)
        
        # predictions
        predictions = outputs.logits.argmax(dim=2)

        # Remove ignored index (special tokens)
        true_predictions = [
            [id2label[p.item()] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(predictions, labels)
        ]
        true_labels = [
            [id2label[l.item()] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(predictions, labels)
        ]

        metric.add_batch(predictions=true_predictions, references=true_labels)

final_score = metric.compute()
print(final_score)

##Error

Epoch: 0
  0%|          | 0/38 [00:00<?, ?it/s]
/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Traceback (most recent call last):
  File "llmv2_demo.py", line 111, in <module>
    outputs = model(**batch) 
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1167, in forward
    outputs = self.layoutlmv2(
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 898, in forward
    visual_emb = self._calc_img_embeddings(
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 762, in _calc_img_embeddings
    visual_embeddings = self.visual_proj(self.visual(image))
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 590, in forward
    features = self.backbone(images_input)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward
    bottom_up_features = self.bottom_up(x)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/resnet.py", line 449, in forward
    x = stage(x)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/resnet.py", line 195, in forward
    out = self.conv1(x)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/layers/wrappers.py", line 84, in forward
    x = F.conv2d(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking arugment for argument weight in method wrapper_cudnn_convolution)

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:12 (5 by maintainers)

Top GitHub Comments

2reactions

NielsRoggecommented, Oct 22, 2021

Are you running all of this in a notebook or as a script? The authors defined everything in a Python script, which they then launch as follows:

cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_funsd.py \
        --model_name_or_path microsoft/layoutlmv2-base-uncased \
        --output_dir /tmp/test-ner \
        --do_train \
        --do_predict \
        --max_steps 1000 \
        --warmup_ratio 0.1 \
        --fp16

That’s the recommended way to train deep learning models with PyTorch on multiple GPUs. torch.distributed.launch is a helper utility that can be used to launch multiple processes per node for distributed training.

It would be great if we can add an example script for LayoutLMv2/LayoutXLM to the examples folder of HuggingFace Transformers. It would mean updating the Python script for it to work with HuggingFace Transformers instead of the original unilm repository.

Are you interested in contributing this?

2reactions

theMADAIguycommented, Oct 22, 2021

Hi @NielsRogge Thanks for your quick response. I looked at that repo as well just a couple of minutes back. The problem that I face using that solution is it gives this error:

raise RuntimeError("Make sure torch.distributed is set up properly.")
RuntimeError: Make sure torch.distributed is set up properly.

I read the above-linked post. The OP there also faces the same problem and you recommend the following:

You probably first need to call torch.distributed.init_process_group() before starting training.

Using this in the code forces me to implement DistributedDataParallel instead of the conventional DataParallel. Can you suggest something to help further?

It requires setting up the backend, rank, and world_size for DistributedDataParallel. Is this the way to go? Can you give an example of a running script that handles batch synchronization without forcing with DataParallel?

Currently, I have added the following lines of code in my script:

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

torch.distributed.init_process_group("nccl", rank=0, world_size=2)
model=LayoutLMv2_Classification_model().to(device)
model.LayoutLMv2Encoder.visual.synchronize_batch_norm()

The terminal hangs and there is no output displayed.

Any help on this case will be highly appreciated!! Thanks once again!