Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Updating lightning from v1.7.7 to 1.8.0 significantly affects results

See original GitHub issue

Bug description

The training curves and the final performance of models is significantly affected by the update from lightning v1.7.7 to v1.8.0 when training with ddp strategy. Here are the curves that I obtained with the code below.

The 4 curves with higher accuracy (lower loss) are obtained with 1.7.7, using a single node with 8 gpus for training. The 4 curves with lower accuracy (higher loss) are obtained with 1.8.0, using a single node with 8 gpus for training.

Each curve uses a different seed.

I do not know whether a similar issue happens with single-gpu training or with other strategies (didn’t test).

The code is essentially taken from the pytorch-cifar10-94%-baseline tutorial. (Note that with other code/experiments of mine, the training curves end up being even further apart than in this example.)

How to reproduce the bug

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import CSVLogger, TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 256 if torch.cuda.is_available() else 64
NUM_WORKERS = int(os.cpu_count() / 2)

train_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomCrop(32, padding=4),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

test_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)


def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model


class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}


if __name__ == "__main__":
    model = LitResnet(lr=0.05)

    trainer = Trainer(
        max_epochs=100,
        accelerator="auto",
        gpus=list(range(torch.cuda.device_count())),
        strategy="ddp",
        logger=[
            TensorBoardLogger(save_dir="logs/", name="tensorboard"),
            CSVLogger(save_dir="logs/", name="csv", version=3),
        ],
        callbacks=[LearningRateMonitor(logging_interval="step")],
        enable_progress_bar=False,
    )

    trainer.fit(model, cifar10_dm)
    trainer.test(model, datamodule=cifar10_dm)

Environment

With lightning 1.7.7

* CUDA:
    - GPU:
        - Tesla T4
    - available:         True
    - version:           11.7
* Lightning:
    - lightning-bolts:   0.6.0.post1
    - lightning-utilities: 0.4.1
    - pytorch-lightning: 1.7.7
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
* Packages:
    - absl-py:           1.3.0
    - aiohttp:           3.8.3
    - aiosignal:         1.3.1
    - async-timeout:     4.0.2
    - attrs:             22.1.0
    - cachetools:        5.2.0
    - certifi:           2022.9.24
    - charset-normalizer: 2.1.1
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - google-auth:       2.14.1
    - google-auth-oauthlib: 0.4.6
    - grpcio:            1.50.0
    - idna:              3.4
    - importlib-metadata: 5.0.0
    - lightning-bolts:   0.6.0.post1
    - lightning-utilities: 0.4.1
    - markdown:          3.4.1
    - markupsafe:        2.1.1
    - multidict:         6.0.2
    - numpy:             1.23.4
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - oauthlib:          3.2.2
    - packaging:         21.3
    - pillow:            9.3.0
    - pip:               22.3.1
    - protobuf:          3.20.3
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.8
    - pydeprecate:       0.3.2
    - pyparsing:         3.0.9
    - pytorch-lightning: 1.7.7
    - pyyaml:            6.0
    - requests:          2.28.1
    - requests-oauthlib: 1.3.1
    - rsa:               4.9
    - setuptools:        65.3.0
    - six:               1.16.0
    - tensorboard:       2.11.0
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
    - tqdm:              4.64.1
    - typing-extensions: 4.4.0
    - urllib3:           1.26.12
    - werkzeug:          2.2.2
    - wheel:             0.37.1
    - yarl:              1.8.1
    - zipp:              3.10.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.8.13
    - version:           #1 SMP Wed Jun 29 23:49:26 UTC 2022

With lightning 1.8.0:

* CUDA:
    - GPU:
        - Tesla T4
    - available:         True
    - version:           11.7
* Lightning:
    - lightning-bolts:   0.6.0.post1
    - lightning-lite:    1.8.0
    - lightning-utilities: 0.3.0
    - pytorch-lightning: 1.8.0
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
* Packages:
    - absl-py:           1.3.0
    - aiohttp:           3.8.3
    - aiosignal:         1.3.1
    - async-timeout:     4.0.2
    - attrs:             22.1.0
    - cachetools:        5.2.0
    - certifi:           2022.9.24
    - charset-normalizer: 2.1.1
    - fire:              0.4.0
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - google-auth:       2.14.1
    - google-auth-oauthlib: 0.4.6
    - grpcio:            1.50.0
    - idna:              3.4
    - importlib-metadata: 5.0.0
    - lightning-bolts:   0.6.0.post1
    - lightning-lite:    1.8.0
    - lightning-utilities: 0.3.0
    - markdown:          3.4.1
    - markupsafe:        2.1.1
    - multidict:         6.0.2
    - numpy:             1.23.4
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - oauthlib:          3.2.2
    - packaging:         21.3
    - pillow:            9.3.0
    - pip:               22.3.1
    - protobuf:          3.20.3
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.8
    - pydeprecate:       0.3.2
    - pyparsing:         3.0.9
    - pytorch-lightning: 1.8.0
    - pyyaml:            6.0
    - requests:          2.28.1
    - requests-oauthlib: 1.3.1
    - rsa:               4.9
    - setuptools:        65.3.0
    - six:               1.16.0
    - tensorboard:       2.11.0
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - termcolor:         2.1.0
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
    - tqdm:              4.64.1
    - typing-extensions: 4.4.0
    - urllib3:           1.26.12
    - werkzeug:          2.2.2
    - wheel:             0.37.1
    - yarl:              1.8.1
    - zipp:              3.10.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.8.13
    - version:           #1 SMP Wed Jun 29 23:49:26 UTC 2022

Differences:

diff env_details_177.txt env_details_180.txt
8,9c8,10
< 	- lightning-utilities: 0.4.1
< 	- pytorch-lightning: 1.7.7
---
> 	- lightning-lite:    1.8.0
> 	- lightning-utilities: 0.3.0
> 	- pytorch-lightning: 1.8.0
21a23
> 	- fire:              0.4.0
30c32,33
< 	- lightning-utilities: 0.4.1
---
> 	- lightning-lite:    1.8.0
> 	- lightning-utilities: 0.3.0
48c51
< 	- pytorch-lightning: 1.7.7
---
> 	- pytorch-lightning: 1.8.0
57a61
> 	- termcolor:         2.1.0

cc @tchaton @justusschock @awaelchli @akihironitta @borda

Issue Analytics

State:
Created 10 months ago
Comments:9 (3 by maintainers)

Top GitHub Comments

2reactions

awaelchlicommented, Dec 5, 2022

I ran the provided code on 1.7.7 and master (commit a86584d6dd4d50388c7dcef4f3854b0e8355b346). I get similar loss curves. After setting deterministic=True and rerunning both versions, I get identical results (8 GPUs being used).

Note that there are some multiprocessing issues with the provided code, since not all the code is guarded by if name == main. This does not affect results but should be fixed.

@cjsg Could you give me the raw printout of your pip freeze command so that I can install the same environment? Thanks.

For reference, here is the complete modified code I ran to make results deterministic:

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import CSVLogger, TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy


def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model


class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}


if __name__ == "__main__":
    seed_everything(7)

    PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
    BATCH_SIZE = 256 if torch.cuda.is_available() else 64
    NUM_WORKERS = int(os.cpu_count() / 2)

    train_transforms = torchvision.transforms.Compose(
        [
            torchvision.transforms.RandomCrop(32, padding=4),
            torchvision.transforms.RandomHorizontalFlip(),
            torchvision.transforms.ToTensor(),
            cifar10_normalization(),
        ]
    )

    test_transforms = torchvision.transforms.Compose(
        [
            torchvision.transforms.ToTensor(),
            cifar10_normalization(),
        ]
    )

    cifar10_dm = CIFAR10DataModule(
        data_dir=PATH_DATASETS,
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        train_transforms=train_transforms,
        test_transforms=test_transforms,
        val_transforms=test_transforms,
    )

    model = LitResnet(lr=0.05)

    trainer = Trainer(
        max_epochs=100,
        accelerator="auto",
        gpus=list(range(torch.cuda.device_count())),
        strategy="ddp",
        logger=[
            TensorBoardLogger(save_dir="logs/", name="tensorboard"),
            CSVLogger(save_dir="logs/", name="csv", version=3),
        ],
        callbacks=[LearningRateMonitor(logging_interval="step")],
        enable_progress_bar=False,
        deterministic=True,
    )

    trainer.fit(model, cifar10_dm)
    trainer.test(model, datamodule=cifar10_dm)

0reactions

cjsgcommented, Dec 6, 2022

@awaelchli Just finished testing with the lightning git repo installs and your commands from above (master@05dbf48ad). The new curves essentially overlap with the ones I got using pip lightning 1.7.7 / 1.8.0 installs. See plots below.

There are 2 clear “beams” of curves, one for lightning 1.7.7 and one for lightning >= 1.8.0. In both cases, the beams contain curves generated with pip installs (1.7.7 and 1.8.0 respectively) and with git checkout installs (tags/1.7.7 and master respectively), with and without the deterministic=True option.

Top Results From Across the Web

Changelog — PyTorch Lightning 1.8.5 documentation

Updated compatibility for LightningLite to run with the latest DeepSpeed 0.7.0 ... Fixed epoch-end logging results not being reset after the end of...

Quality of random number generators significantly affects ...

It is obvious that using a random number generator with a longer period should not decrease the quality of the results. At the...

Java Edition 1.8 - Minecraft Wiki - Fandom

Clones all the blocks from a given area to a different given area. Up to 32768 (32 3 ) blocks can be copied...

[Mystic Messenger] V Route Update ... - Cheritz Team on Tumblr

Our team has failed to estimate the numbers of pre-orders, and the actual orders far exceeded the expected orders. As a result, the...

Untitled

Growing poinsettias outside, Cm albufeira leitura agua, Se80 bseg, Relegemse dorpsfeesten, Regex split javascript, Impact jax spotlight, Windridge packing ...