Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DDP (multi GPU) Iterable Dataset is not working as expected ?

See original GitHub issue

Bug description

Hi,

I am currently testing with IterableDataset and DDP.

Total Examples - 10000 Batch_size - 32 NUM_GPUS - 2 .

While using IterableDataset , ideally with 2 GPUS, we are supposed to run 157 steps (10000 / 32 batch / 2 gpus) in one epoch. But, instead of that, it is running for 314 steps (10000 / 32 batch) .

This issue is only with IterableDataset. When I am using normal Dataset (map dataset) from torch things are good and fine. Is there any reason for this particular behaviour ?

How to reproduce the bug

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import lightning as L
import torch
import time
from datasets import list_datasets, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, Dataset


BATCH_SIZE = 32
NUM_WORKERS = 1

# Load Dataset in Memory

imdb_data = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

imdb_dataset = imdb_data
imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)
imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

def custom_iterator():
    counter = 0
    for item in imdb_tokenized['train']:
        
        inputs = {'input_ids': item['input_ids'], 'attention_mask': item['attention_mask']}
        labels = {'labels': item['label']}
        counter += 1
        yield inputs, labels


class MyIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self):
        super().__init__()

    def __iter__(self):
        yield from custom_iterator()

train_dataset = MyIterableDataset()

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    persistent_workers=False
)

# Load Model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2)

# Ligntning Module
class LightningModel(L.LightningModule):
    def __init__(self, model, learning_rate=5e-5):
        super().__init__()

        self.learning_rate = learning_rate
        self.model = model

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)
        
    def training_step(self, batch, batch_idx):
        
        inputs, labels = batch
        outputs = self(inputs["input_ids"], attention_mask=inputs["attention_mask"],
                    labels=labels["labels"])        
        self.log("train_loss", outputs["loss"])
        
        # print(" Tensor sum ", torch.sum(inputs['input_ids']))
        # print("-------------------")
        # print(3*"\n")
        
        self.log("tensor_sum", torch.sum(inputs['input_ids']))
        
        return outputs["loss"]  # this is passed to the optimizer for training

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer
    

lightning_model = LightningModel(model)

from pytorch_lightning.loggers import CSVLogger, WandbLogger

name = "train_ddp_map-iterable"
logger = CSVLogger(save_dir="logs/", name=name)
wandb_logger = WandbLogger(project="DDP_exps", name=name)

def train_model():
    
    max_epochs = 2
    
    if os.path.exists('checkpoints'):
        import shutil
        shutil.rmtree('checkpoints')
        
    trainer = L.Trainer(
        max_epochs=max_epochs,
        callbacks=None,
        accelerator="gpu",
        devices=[0, 1],
        logger=[logger, wandb_logger],
        strategy='ddp',
        enable_progress_bar=True, # Disable progress bar
        log_every_n_steps=1,
    )

    trainer.fit(model=lightning_model,
                train_dataloaders=train_loader)
    
if __name__=='__main__':
    
    start_time = time.time()
    train_model()
    end_time = time.time()
    print()
    print("Time taken to train model is {} seconds".format(end_time-start_time))

Error messages and logs


# Error messages and logs here please

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

Issue Analytics

State:
Created 10 months ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

awaelchlicommented, Nov 21, 2022

If my num_workers=2, it is looping 628 times in each GPU . Is this expected ?

No. num_workers has nothing to do with the sampling of the data.

Because, num_workers=2 is supposed to make DataLoader pipeline faster right.

Read more about workers here: https://pytorch.org/docs/stable/data.html#multi-process-data-loading

Is there any concept of steps_per_epoch in lightning. Say, epochs=10, steps_per_epoch=1000, I want each epoch to run 1000 loops max.

Trainer(limit_train_batches=1000, max_epochs=10)

1reaction

awaelchlicommented, Nov 20, 2022

Yes this is expected. Lightning can’t know how to shard the data/iterator you provide. You need to make sure your iterator returns half of the data on GPU 0 and the other half on GPU 1. You can do this for example by changing your for loop to something like this (typos expected):

for item in imdb_tokenized['train'][rank::num_gpus]:
    ...

This shards your data. The rank can be accessed for example through trainer.global_rank. If you do this, you need to make sure the iterator returns the same amount of data on each rank (e.g., drop the remainder)

Another way would be to use the DistribuedSampler inside your iterable dataset.

Top Results From Across the Web

Using IterableDataset with DistributedDataParallel - distributed

I have been using an IterableDataset since my text file won't fit into memory. ... Thats not how DDP works, it will take...

‍ ‍ ‍ Distributed Training - Composer - MosaicML

Composer supports distributed training on multiple devices, whether it be multiple GPUs on a single node or multiple GPUs across multiple nodes.

1.2.8 PDF - PyTorch Lightning Documentation

In this guide we'll show you how to organize your PyTorch code into Lightning in 2 steps. Organizing your code with PyTorch Lightning...

Training on multiple GPUs with DistributedDataParallel - Opacus

Training with Opacus on multiple GPUs with Distributed Data Parallel¶ ... to make it work. Second, Jupyter notebooks are known not to support...

apex.parallel — Apex 0.1.0 documentation - GitHub Pages

module – Network definition to be run in multi-gpu/distributed mode. ... to toggle whether or not DDP averages the allreduced gradients over processes....