DDP (multi GPU) Iterable Dataset is not working as expected ?
See original GitHub issueBug description
Hi,
I am currently testing with IterableDataset and DDP.
Total Examples - 10000
Batch_size - 32
NUM_GPUS - 2
.
While using IterableDataset , ideally with 2 GPUS, we are supposed to run 157 steps (10000 / 32 batch / 2 gpus)
in one epoch. But, instead of that, it is running for 314 steps (10000 / 32 batch)
.
This issue is only with IterableDataset. When I am using normal Dataset (map dataset) from torch things are good and fine. Is there any reason for this particular behaviour ?
How to reproduce the bug
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import lightning as L
import torch
import time
from datasets import list_datasets, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, Dataset
BATCH_SIZE = 32
NUM_WORKERS = 1
# Load Dataset in Memory
imdb_data = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_text(batch):
return tokenizer(batch["text"], truncation=True, padding=True)
imdb_dataset = imdb_data
imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)
imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
def custom_iterator():
counter = 0
for item in imdb_tokenized['train']:
inputs = {'input_ids': item['input_ids'], 'attention_mask': item['attention_mask']}
labels = {'labels': item['label']}
counter += 1
yield inputs, labels
class MyIterableDataset(torch.utils.data.IterableDataset):
def __init__(self):
super().__init__()
def __iter__(self):
yield from custom_iterator()
train_dataset = MyIterableDataset()
train_loader = DataLoader(
dataset=train_dataset,
batch_size=BATCH_SIZE,
num_workers=NUM_WORKERS,
persistent_workers=False
)
# Load Model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2)
# Ligntning Module
class LightningModel(L.LightningModule):
def __init__(self, model, learning_rate=5e-5):
super().__init__()
self.learning_rate = learning_rate
self.model = model
def forward(self, input_ids, attention_mask, labels):
return self.model(input_ids, attention_mask=attention_mask, labels=labels)
def training_step(self, batch, batch_idx):
inputs, labels = batch
outputs = self(inputs["input_ids"], attention_mask=inputs["attention_mask"],
labels=labels["labels"])
self.log("train_loss", outputs["loss"])
# print(" Tensor sum ", torch.sum(inputs['input_ids']))
# print("-------------------")
# print(3*"\n")
self.log("tensor_sum", torch.sum(inputs['input_ids']))
return outputs["loss"] # this is passed to the optimizer for training
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
return optimizer
lightning_model = LightningModel(model)
from pytorch_lightning.loggers import CSVLogger, WandbLogger
name = "train_ddp_map-iterable"
logger = CSVLogger(save_dir="logs/", name=name)
wandb_logger = WandbLogger(project="DDP_exps", name=name)
def train_model():
max_epochs = 2
if os.path.exists('checkpoints'):
import shutil
shutil.rmtree('checkpoints')
trainer = L.Trainer(
max_epochs=max_epochs,
callbacks=None,
accelerator="gpu",
devices=[0, 1],
logger=[logger, wandb_logger],
strategy='ddp',
enable_progress_bar=True, # Disable progress bar
log_every_n_steps=1,
)
trainer.fit(model=lightning_model,
train_dataloaders=train_loader)
if __name__=='__main__':
start_time = time.time()
train_model()
end_time = time.time()
print()
print("Time taken to train model is {} seconds".format(end_time-start_time))
Error messages and logs
# Error messages and logs here please
Environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Using IterableDataset with DistributedDataParallel - distributed
I have been using an IterableDataset since my text file won't fit into memory. ... Thats not how DDP works, it will take...
Read more > Distributed Training - Composer - MosaicML
Composer supports distributed training on multiple devices, whether it be multiple GPUs on a single node or multiple GPUs across multiple nodes.
Read more >1.2.8 PDF - PyTorch Lightning Documentation
In this guide we'll show you how to organize your PyTorch code into Lightning in 2 steps. Organizing your code with PyTorch Lightning...
Read more >Training on multiple GPUs with DistributedDataParallel - Opacus
Training with Opacus on multiple GPUs with Distributed Data Parallel¶ ... to make it work. Second, Jupyter notebooks are known not to support...
Read more >apex.parallel — Apex 0.1.0 documentation - GitHub Pages
module – Network definition to be run in multi-gpu/distributed mode. ... to toggle whether or not DDP averages the allreduced gradients over processes....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
No. num_workers has nothing to do with the sampling of the data.
Read more about workers here: https://pytorch.org/docs/stable/data.html#multi-process-data-loading
Yes this is expected. Lightning can’t know how to shard the data/iterator you provide. You need to make sure your iterator returns half of the data on GPU 0 and the other half on GPU 1. You can do this for example by changing your for loop to something like this (typos expected):
This shards your data. The rank can be accessed for example through
trainer.global_rank
. If you do this, you need to make sure the iterator returns the same amount of data on each rank (e.g., drop the remainder)Another way would be to use the DistribuedSampler inside your iterable dataset.