LayoutLMv2 model not supporting training on more than 1 GPU when using PyTorch Data Parallel
See original GitHub issueEnvironment info
transformers
version: 4.11.2- Platform: Linux-5.4.0-66-generic-x86_64-with-glibc2.10
- Python version: 3.8.8
- PyTorch version (GPU?): 1.9.1+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help
Models: LayoutLMv2 @NielsRogge
Information
Model I am using: LayoutLMv2
The problem arises when using:
- my own modified scripts
The tasks I am working on is:
- token classification FUNSD
To reproduce
Steps to reproduce the behavior:
- Run the below script with more than 1 GPU
from datasets import load_dataset
import torch
from torch.nn import DataParallel
from PIL import Image
from transformers import LayoutLMv2Processor
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D
from torch.utils.data import DataLoader
from transformers import LayoutLMv2ForTokenClassification, AdamW
import torch
from tqdm.notebook import tqdm
from datasets import load_metric
use_cuda = torch.cuda.is_available()
device= torch.device('cuda:0' if use_cuda else 'cpu')
print(device)
device_ids = [0,1]
datasets = load_dataset("nielsr/funsd")
labels = datasets['train'].features['ner_tags'].feature.names
print(labels)
id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}
##Next, let's use `LayoutLMv2Processor` to prepare the data for the model.
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
# we need to define custom features
features = Features({
'image': Array3D(dtype="int64", shape=(3, 224, 224)),
'input_ids': Sequence(feature=Value(dtype='int64')),
'attention_mask': Sequence(Value(dtype='int64')),
'token_type_ids': Sequence(Value(dtype='int64')),
'bbox': Array2D(dtype="int64", shape=(512, 4)),
'labels': Sequence(ClassLabel(names=labels)),
})
def preprocess_data(examples):
images = [Image.open(path).convert("RGB") for path in examples['image_path']]
words = examples['words']
boxes = examples['bboxes']
word_labels = examples['ner_tags']
encoded_inputs = processor(images, words, boxes=boxes, word_labels=word_labels,
padding="max_length", truncation=True)
return encoded_inputs
train_dataset = datasets['train'].map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names,
features=features)
test_dataset = datasets['test'].map(preprocess_data, batched=True, remove_columns=datasets['test'].column_names,
features=features)
processor.tokenizer.decode(train_dataset['input_ids'][0])
print(train_dataset['labels'][0])
##Finally, let's set the format to PyTorch, and place everything on the GPU:
train_dataset.set_format(type="torch", device=device)
test_dataset.set_format(type="torch", device=device)
train_dataset.features.keys()
##Next, we create corresponding dataloaders.
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=2)
##Let's verify a batch:
batch = next(iter(train_dataloader))
for k,v in batch.items():
print(k, v.shape)
## Train the model
##Here we train the model in native PyTorch. We use the AdamW optimizer.
model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutlmv2-base-uncased',
num_labels=len(labels))
if use_cuda:
model = DataParallel(model,device_ids=device_ids)
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
global_step = 0
num_train_epochs = 6
t_total = len(train_dataloader) * num_train_epochs # total number of training steps
#put the model in training mode
model.train()
for epoch in range(num_train_epochs):
print("Epoch:", epoch)
for batch in tqdm(train_dataloader):
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = model(**batch)
loss = outputs.loss
# print loss every 100 steps
if global_step % 100 == 0:
print(f"Loss after {global_step} steps: {loss.item()}")
loss.backward()
optimizer.step()
global_step += 1
## Evaluation
#Next, let's evaluate the model on the test set.
metric = load_metric("seqeval")
# put model in evaluation mode
model.eval()
for batch in tqdm(test_dataloader, desc="Evaluating"):
with torch.no_grad():
input_ids = batch['input_ids'].to(device)
bbox = batch['bbox'].to(device)
image = batch['image'].to(device)
attention_mask = batch['attention_mask'].to(device)
token_type_ids = batch['token_type_ids'].to(device)
labels = batch['labels'].to(device)
# forward pass
outputs = model(input_ids=input_ids, bbox=bbox, image=image, attention_mask=attention_mask,
token_type_ids=token_type_ids, labels=labels)
# predictions
predictions = outputs.logits.argmax(dim=2)
# Remove ignored index (special tokens)
true_predictions = [
[id2label[p.item()] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[id2label[l.item()] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
metric.add_batch(predictions=true_predictions, references=true_labels)
final_score = metric.compute()
print(final_score)
##Error
Epoch: 0
0%| | 0/38 [00:00<?, ?it/s]
/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
Traceback (most recent call last):
File "llmv2_demo.py", line 111, in <module>
outputs = model(**batch)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1167, in forward
outputs = self.layoutlmv2(
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 898, in forward
visual_emb = self._calc_img_embeddings(
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 762, in _calc_img_embeddings
visual_embeddings = self.visual_proj(self.visual(image))
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 590, in forward
features = self.backbone(images_input)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward
bottom_up_features = self.bottom_up(x)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/resnet.py", line 449, in forward
x = stage(x)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/resnet.py", line 195, in forward
out = self.conv1(x)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/layers/wrappers.py", line 84, in forward
x = F.conv2d(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking arugment for argument weight in method wrapper_cudnn_convolution)
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Why I don't train model with Distribute Data Parallel?
I reproduce the training code from DataParallel to DistributedDataParallel, It does not release bugs in training, but it does not print any ...
Read more >Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a ... DistributedDataParallel (DDP) is typically faster...
Read more >Multi-GPU Training in Pytorch: Data and Model Parallelism
This post will provide an overview of multi-GPU training in Pytorch, including: training on one GPU; training on multiple GPUs; use of data...
Read more >GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Data Parallel ( strategy='dp' ) (multiple-gpus, 1 machine) ... For the same reason we cannot fully support Manual Optimization with DP. Use DDP...
Read more >Multi GPU training with Pytorch - AIME Servers
Multi GPU training in a single process ( DataParallel ) ... Here is a fully working example of multi GPU training with a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Are you running all of this in a notebook or as a script? The authors defined everything in a Python script, which they then launch as follows:
That’s the recommended way to train deep learning models with PyTorch on multiple GPUs.
torch.distributed.launch
is a helper utility that can be used to launch multiple processes per node for distributed training.It would be great if we can add an example script for LayoutLMv2/LayoutXLM to the examples folder of HuggingFace Transformers. It would mean updating the Python script for it to work with HuggingFace Transformers instead of the original unilm repository.
Are you interested in contributing this?
Hi @NielsRogge Thanks for your quick response. I looked at that repo as well just a couple of minutes back. The problem that I face using that solution is it gives this error:
I read the above-linked post. The OP there also faces the same problem and you recommend the following:
Using this in the code forces me to implement DistributedDataParallel instead of the conventional DataParallel. Can you suggest something to help further?
It requires setting up the backend, rank, and world_size for DistributedDataParallel. Is this the way to go? Can you give an example of a running script that handles batch synchronization without forcing with DataParallel?
Currently, I have added the following lines of code in my script:
The terminal hangs and there is no output displayed.
Any help on this case will be highly appreciated!! Thanks once again!