Updating lightning from v1.7.7 to 1.8.0 significantly affects results
See original GitHub issueBug description
The training curves and the final performance of models is significantly affected by the update from lightning v1.7.7 to v1.8.0 when training with ddp strategy. Here are the curves that I obtained with the code below.

The 4 curves with higher accuracy (lower loss) are obtained with 1.7.7, using a single node with 8 gpus for training. The 4 curves with lower accuracy (higher loss) are obtained with 1.8.0, using a single node with 8 gpus for training.
Each curve uses a different seed.
I do not know whether a similar issue happens with single-gpu training or with other strategies (didn’t test).
The code is essentially taken from the pytorch-cifar10-94%-baseline tutorial. (Note that with other code/experiments of mine, the training curves end up being even further apart than in this example.)
How to reproduce the bug
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import CSVLogger, TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy
seed_everything(7)
PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
BATCH_SIZE = 256 if torch.cuda.is_available() else 64
NUM_WORKERS = int(os.cpu_count() / 2)
train_transforms = torchvision.transforms.Compose(
[
torchvision.transforms.RandomCrop(32, padding=4),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.ToTensor(),
cifar10_normalization(),
]
)
test_transforms = torchvision.transforms.Compose(
[
torchvision.transforms.ToTensor(),
cifar10_normalization(),
]
)
cifar10_dm = CIFAR10DataModule(
data_dir=PATH_DATASETS,
batch_size=BATCH_SIZE,
num_workers=NUM_WORKERS,
train_transforms=train_transforms,
test_transforms=test_transforms,
val_transforms=test_transforms,
)
def create_model():
model = torchvision.models.resnet18(pretrained=False, num_classes=10)
model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
model.maxpool = nn.Identity()
return model
class LitResnet(LightningModule):
def __init__(self, lr=0.05):
super().__init__()
self.save_hyperparameters()
self.model = create_model()
def forward(self, x):
out = self.model(x)
return F.log_softmax(out, dim=1)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.nll_loss(logits, y)
self.log("train_loss", loss)
return loss
def evaluate(self, batch, stage=None):
x, y = batch
logits = self(x)
loss = F.nll_loss(logits, y)
preds = torch.argmax(logits, dim=1)
acc = accuracy(preds, y)
if stage:
self.log(f"{stage}_loss", loss, prog_bar=True)
self.log(f"{stage}_acc", acc, prog_bar=True)
def validation_step(self, batch, batch_idx):
self.evaluate(batch, "val")
def test_step(self, batch, batch_idx):
self.evaluate(batch, "test")
def configure_optimizers(self):
optimizer = torch.optim.SGD(
self.parameters(),
lr=self.hparams.lr,
momentum=0.9,
weight_decay=5e-4,
)
steps_per_epoch = 45000 // BATCH_SIZE
scheduler_dict = {
"scheduler": OneCycleLR(
optimizer,
0.1,
epochs=self.trainer.max_epochs,
steps_per_epoch=steps_per_epoch,
),
"interval": "step",
}
return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}
if __name__ == "__main__":
model = LitResnet(lr=0.05)
trainer = Trainer(
max_epochs=100,
accelerator="auto",
gpus=list(range(torch.cuda.device_count())),
strategy="ddp",
logger=[
TensorBoardLogger(save_dir="logs/", name="tensorboard"),
CSVLogger(save_dir="logs/", name="csv", version=3),
],
callbacks=[LearningRateMonitor(logging_interval="step")],
enable_progress_bar=False,
)
trainer.fit(model, cifar10_dm)
trainer.test(model, datamodule=cifar10_dm)
Environment
With lightning 1.7.7
* CUDA:
- GPU:
- Tesla T4
- available: True
- version: 11.7
* Lightning:
- lightning-bolts: 0.6.0.post1
- lightning-utilities: 0.4.1
- pytorch-lightning: 1.7.7
- torch: 1.13.0
- torchmetrics: 0.10.2
- torchvision: 0.14.0
* Packages:
- absl-py: 1.3.0
- aiohttp: 3.8.3
- aiosignal: 1.3.1
- async-timeout: 4.0.2
- attrs: 22.1.0
- cachetools: 5.2.0
- certifi: 2022.9.24
- charset-normalizer: 2.1.1
- frozenlist: 1.3.3
- fsspec: 2022.11.0
- google-auth: 2.14.1
- google-auth-oauthlib: 0.4.6
- grpcio: 1.50.0
- idna: 3.4
- importlib-metadata: 5.0.0
- lightning-bolts: 0.6.0.post1
- lightning-utilities: 0.4.1
- markdown: 3.4.1
- markupsafe: 2.1.1
- multidict: 6.0.2
- numpy: 1.23.4
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- oauthlib: 3.2.2
- packaging: 21.3
- pillow: 9.3.0
- pip: 22.3.1
- protobuf: 3.20.3
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pydeprecate: 0.3.2
- pyparsing: 3.0.9
- pytorch-lightning: 1.7.7
- pyyaml: 6.0
- requests: 2.28.1
- requests-oauthlib: 1.3.1
- rsa: 4.9
- setuptools: 65.3.0
- six: 1.16.0
- tensorboard: 2.11.0
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- torch: 1.13.0
- torchmetrics: 0.10.2
- torchvision: 0.14.0
- tqdm: 4.64.1
- typing-extensions: 4.4.0
- urllib3: 1.26.12
- werkzeug: 2.2.2
- wheel: 0.37.1
- yarl: 1.8.1
- zipp: 3.10.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.13
- version: #1 SMP Wed Jun 29 23:49:26 UTC 2022
With lightning 1.8.0:
* CUDA:
- GPU:
- Tesla T4
- available: True
- version: 11.7
* Lightning:
- lightning-bolts: 0.6.0.post1
- lightning-lite: 1.8.0
- lightning-utilities: 0.3.0
- pytorch-lightning: 1.8.0
- torch: 1.13.0
- torchmetrics: 0.10.2
- torchvision: 0.14.0
* Packages:
- absl-py: 1.3.0
- aiohttp: 3.8.3
- aiosignal: 1.3.1
- async-timeout: 4.0.2
- attrs: 22.1.0
- cachetools: 5.2.0
- certifi: 2022.9.24
- charset-normalizer: 2.1.1
- fire: 0.4.0
- frozenlist: 1.3.3
- fsspec: 2022.11.0
- google-auth: 2.14.1
- google-auth-oauthlib: 0.4.6
- grpcio: 1.50.0
- idna: 3.4
- importlib-metadata: 5.0.0
- lightning-bolts: 0.6.0.post1
- lightning-lite: 1.8.0
- lightning-utilities: 0.3.0
- markdown: 3.4.1
- markupsafe: 2.1.1
- multidict: 6.0.2
- numpy: 1.23.4
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- oauthlib: 3.2.2
- packaging: 21.3
- pillow: 9.3.0
- pip: 22.3.1
- protobuf: 3.20.3
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pydeprecate: 0.3.2
- pyparsing: 3.0.9
- pytorch-lightning: 1.8.0
- pyyaml: 6.0
- requests: 2.28.1
- requests-oauthlib: 1.3.1
- rsa: 4.9
- setuptools: 65.3.0
- six: 1.16.0
- tensorboard: 2.11.0
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- termcolor: 2.1.0
- torch: 1.13.0
- torchmetrics: 0.10.2
- torchvision: 0.14.0
- tqdm: 4.64.1
- typing-extensions: 4.4.0
- urllib3: 1.26.12
- werkzeug: 2.2.2
- wheel: 0.37.1
- yarl: 1.8.1
- zipp: 3.10.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.13
- version: #1 SMP Wed Jun 29 23:49:26 UTC 2022
Differences:
diff env_details_177.txt env_details_180.txt
8,9c8,10
< - lightning-utilities: 0.4.1
< - pytorch-lightning: 1.7.7
---
> - lightning-lite: 1.8.0
> - lightning-utilities: 0.3.0
> - pytorch-lightning: 1.8.0
21a23
> - fire: 0.4.0
30c32,33
< - lightning-utilities: 0.4.1
---
> - lightning-lite: 1.8.0
> - lightning-utilities: 0.3.0
48c51
< - pytorch-lightning: 1.7.7
---
> - pytorch-lightning: 1.8.0
57a61
> - termcolor: 2.1.0
Issue Analytics
- State:
- Created 10 months ago
- Comments:9 (3 by maintainers)
I ran the provided code on 1.7.7 and master (commit a86584d6dd4d50388c7dcef4f3854b0e8355b346). I get similar loss curves. After setting
deterministic=True
and rerunning both versions, I get identical results (8 GPUs being used).Note that there are some multiprocessing issues with the provided code, since not all the code is guarded by
if name == main
. This does not affect results but should be fixed.@cjsg Could you give me the raw printout of your pip freeze command so that I can install the same environment? Thanks.
For reference, here is the complete modified code I ran to make results deterministic:
@awaelchli Just finished testing with the lightning git repo installs and your commands from above (master@05dbf48ad). The new curves essentially overlap with the ones I got using pip lightning 1.7.7 / 1.8.0 installs. See plots below.
There are 2 clear “beams” of curves, one for lightning 1.7.7 and one for lightning >= 1.8.0. In both cases, the beams contain curves generated with pip installs (1.7.7 and 1.8.0 respectively) and with git checkout installs (tags/1.7.7 and master respectively), with and without the deterministic=True option.