question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Updating lightning from v1.7.7 to 1.8.0 significantly affects results

See original GitHub issue

Bug description

The training curves and the final performance of models is significantly affected by the update from lightning v1.7.7 to v1.8.0 when training with ddp strategy. Here are the curves that I obtained with the code below.

Screenshot 2022-11-14 at 15 42 29

The 4 curves with higher accuracy (lower loss) are obtained with 1.7.7, using a single node with 8 gpus for training. The 4 curves with lower accuracy (higher loss) are obtained with 1.8.0, using a single node with 8 gpus for training.

Each curve uses a different seed.

I do not know whether a similar issue happens with single-gpu training or with other strategies (didn’t test).

The code is essentially taken from the pytorch-cifar10-94%-baseline tutorial. (Note that with other code/experiments of mine, the training curves end up being even further apart than in this example.)

How to reproduce the bug

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import CSVLogger, TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 256 if torch.cuda.is_available() else 64
NUM_WORKERS = int(os.cpu_count() / 2)

train_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomCrop(32, padding=4),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

test_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)


def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model


class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}


if __name__ == "__main__":
    model = LitResnet(lr=0.05)

    trainer = Trainer(
        max_epochs=100,
        accelerator="auto",
        gpus=list(range(torch.cuda.device_count())),
        strategy="ddp",
        logger=[
            TensorBoardLogger(save_dir="logs/", name="tensorboard"),
            CSVLogger(save_dir="logs/", name="csv", version=3),
        ],
        callbacks=[LearningRateMonitor(logging_interval="step")],
        enable_progress_bar=False,
    )

    trainer.fit(model, cifar10_dm)
    trainer.test(model, datamodule=cifar10_dm)

Environment

With lightning 1.7.7

* CUDA:
    - GPU:
        - Tesla T4
    - available:         True
    - version:           11.7
* Lightning:
    - lightning-bolts:   0.6.0.post1
    - lightning-utilities: 0.4.1
    - pytorch-lightning: 1.7.7
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
* Packages:
    - absl-py:           1.3.0
    - aiohttp:           3.8.3
    - aiosignal:         1.3.1
    - async-timeout:     4.0.2
    - attrs:             22.1.0
    - cachetools:        5.2.0
    - certifi:           2022.9.24
    - charset-normalizer: 2.1.1
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - google-auth:       2.14.1
    - google-auth-oauthlib: 0.4.6
    - grpcio:            1.50.0
    - idna:              3.4
    - importlib-metadata: 5.0.0
    - lightning-bolts:   0.6.0.post1
    - lightning-utilities: 0.4.1
    - markdown:          3.4.1
    - markupsafe:        2.1.1
    - multidict:         6.0.2
    - numpy:             1.23.4
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - oauthlib:          3.2.2
    - packaging:         21.3
    - pillow:            9.3.0
    - pip:               22.3.1
    - protobuf:          3.20.3
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.8
    - pydeprecate:       0.3.2
    - pyparsing:         3.0.9
    - pytorch-lightning: 1.7.7
    - pyyaml:            6.0
    - requests:          2.28.1
    - requests-oauthlib: 1.3.1
    - rsa:               4.9
    - setuptools:        65.3.0
    - six:               1.16.0
    - tensorboard:       2.11.0
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
    - tqdm:              4.64.1
    - typing-extensions: 4.4.0
    - urllib3:           1.26.12
    - werkzeug:          2.2.2
    - wheel:             0.37.1
    - yarl:              1.8.1
    - zipp:              3.10.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.8.13
    - version:           #1 SMP Wed Jun 29 23:49:26 UTC 2022

With lightning 1.8.0:

* CUDA:
    - GPU:
        - Tesla T4
    - available:         True
    - version:           11.7
* Lightning:
    - lightning-bolts:   0.6.0.post1
    - lightning-lite:    1.8.0
    - lightning-utilities: 0.3.0
    - pytorch-lightning: 1.8.0
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
* Packages:
    - absl-py:           1.3.0
    - aiohttp:           3.8.3
    - aiosignal:         1.3.1
    - async-timeout:     4.0.2
    - attrs:             22.1.0
    - cachetools:        5.2.0
    - certifi:           2022.9.24
    - charset-normalizer: 2.1.1
    - fire:              0.4.0
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - google-auth:       2.14.1
    - google-auth-oauthlib: 0.4.6
    - grpcio:            1.50.0
    - idna:              3.4
    - importlib-metadata: 5.0.0
    - lightning-bolts:   0.6.0.post1
    - lightning-lite:    1.8.0
    - lightning-utilities: 0.3.0
    - markdown:          3.4.1
    - markupsafe:        2.1.1
    - multidict:         6.0.2
    - numpy:             1.23.4
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - oauthlib:          3.2.2
    - packaging:         21.3
    - pillow:            9.3.0
    - pip:               22.3.1
    - protobuf:          3.20.3
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.8
    - pydeprecate:       0.3.2
    - pyparsing:         3.0.9
    - pytorch-lightning: 1.8.0
    - pyyaml:            6.0
    - requests:          2.28.1
    - requests-oauthlib: 1.3.1
    - rsa:               4.9
    - setuptools:        65.3.0
    - six:               1.16.0
    - tensorboard:       2.11.0
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - termcolor:         2.1.0
    - torch:             1.13.0
    - torchmetrics:      0.10.2
    - torchvision:       0.14.0
    - tqdm:              4.64.1
    - typing-extensions: 4.4.0
    - urllib3:           1.26.12
    - werkzeug:          2.2.2
    - wheel:             0.37.1
    - yarl:              1.8.1
    - zipp:              3.10.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.8.13
    - version:           #1 SMP Wed Jun 29 23:49:26 UTC 2022

Differences:

diff env_details_177.txt env_details_180.txt
8,9c8,10
< 	- lightning-utilities: 0.4.1
< 	- pytorch-lightning: 1.7.7
---
> 	- lightning-lite:    1.8.0
> 	- lightning-utilities: 0.3.0
> 	- pytorch-lightning: 1.8.0
21a23
> 	- fire:              0.4.0
30c32,33
< 	- lightning-utilities: 0.4.1
---
> 	- lightning-lite:    1.8.0
> 	- lightning-utilities: 0.3.0
48c51
< 	- pytorch-lightning: 1.7.7
---
> 	- pytorch-lightning: 1.8.0
57a61
> 	- termcolor:         2.1.0

cc @tchaton @justusschock @awaelchli @akihironitta @borda

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
awaelchlicommented, Dec 5, 2022

I ran the provided code on 1.7.7 and master (commit a86584d6dd4d50388c7dcef4f3854b0e8355b346). I get similar loss curves. After setting deterministic=True and rerunning both versions, I get identical results (8 GPUs being used).

Note that there are some multiprocessing issues with the provided code, since not all the code is guarded by if name == main. This does not affect results but should be fixed.

@cjsg Could you give me the raw printout of your pip freeze command so that I can install the same environment? Thanks.

For reference, here is the complete modified code I ran to make results deterministic:

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import CSVLogger, TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy


def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model


class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}


if __name__ == "__main__":
    seed_everything(7)

    PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
    BATCH_SIZE = 256 if torch.cuda.is_available() else 64
    NUM_WORKERS = int(os.cpu_count() / 2)

    train_transforms = torchvision.transforms.Compose(
        [
            torchvision.transforms.RandomCrop(32, padding=4),
            torchvision.transforms.RandomHorizontalFlip(),
            torchvision.transforms.ToTensor(),
            cifar10_normalization(),
        ]
    )

    test_transforms = torchvision.transforms.Compose(
        [
            torchvision.transforms.ToTensor(),
            cifar10_normalization(),
        ]
    )

    cifar10_dm = CIFAR10DataModule(
        data_dir=PATH_DATASETS,
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        train_transforms=train_transforms,
        test_transforms=test_transforms,
        val_transforms=test_transforms,
    )

    model = LitResnet(lr=0.05)

    trainer = Trainer(
        max_epochs=100,
        accelerator="auto",
        gpus=list(range(torch.cuda.device_count())),
        strategy="ddp",
        logger=[
            TensorBoardLogger(save_dir="logs/", name="tensorboard"),
            CSVLogger(save_dir="logs/", name="csv", version=3),
        ],
        callbacks=[LearningRateMonitor(logging_interval="step")],
        enable_progress_bar=False,
        deterministic=True,
    )

    trainer.fit(model, cifar10_dm)
    trainer.test(model, datamodule=cifar10_dm)
0reactions
cjsgcommented, Dec 6, 2022

@awaelchli Just finished testing with the lightning git repo installs and your commands from above (master@05dbf48ad). The new curves essentially overlap with the ones I got using pip lightning 1.7.7 / 1.8.0 installs. See plots below.

Screenshot 2022-12-06 at 22 17 44

There are 2 clear “beams” of curves, one for lightning 1.7.7 and one for lightning >= 1.8.0. In both cases, the beams contain curves generated with pip installs (1.7.7 and 1.8.0 respectively) and with git checkout installs (tags/1.7.7 and master respectively), with and without the deterministic=True option.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Changelog — PyTorch Lightning 1.8.5 documentation
Updated compatibility for LightningLite to run with the latest DeepSpeed 0.7.0 ... Fixed epoch-end logging results not being reset after the end of...
Read more >
Quality of random number generators significantly affects ...
It is obvious that using a random number generator with a longer period should not decrease the quality of the results. At the...
Read more >
Java Edition 1.8 - Minecraft Wiki - Fandom
Clones all the blocks from a given area to a different given area. Up to 32768 (32 3 ) blocks can be copied...
Read more >
[Mystic Messenger] V Route Update ... - Cheritz Team on Tumblr
Our team has failed to estimate the numbers of pre-orders, and the actual orders far exceeded the expected orders. As a result, the...
Read more >
Untitled
Growing poinsettias outside, Cm albufeira leitura agua, Se80 bseg, Relegemse dorpsfeesten, Regex split javascript, Impact jax spotlight, Windridge packing ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found