question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

v1.6 is slower than v1.5

See original GitHub issue

πŸ› Bug

1.6.0 is 5 times slower than 1.5.0

I used the same code and parameters on version 1.5.0 and version 1.6.0. 1.6.0 is 5 times slower than 1.5.0.

To Reproduce

1.6.0

> pip install pytorch-lightning==1.6.0
...
> python run.py
{'optimizers': {'lr': 0.001, 'weight_decay': 0}, 'train': {'milestones': [20, 40, 60, 80, 100, 120, 160, 180, 200, 220], 'gamma': 0.5, 'batch_size': 16}, 'general': {'save_dir': 'logs'}, 'trainer': {'gpus': [0], 'accelerator': 'gpu', 'max_epochs': 1, 'val_check_interval': 1, 'limit_train_batches': 0.01, 'limit_val_batches': 0.01, 'profiler': 'simple'}, 'data': {'cls_type': 'cat'}, 'exp_name': 'test'}
Global seed set to 1234
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1)` was configured so validation will run after every batch.
Missing logger folder: logs/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type               | Params
-------------------------------------------------
0 | model     | PvnetModelResnet18 | 13.0 M
1 | vote_crit | SmoothL1Loss       | 0     
2 | seg_crit  | CrossEntropyLoss   | 0     
-------------------------------------------------
13.0 M    Trainable params
0         Non-trainable params
13.0 M    Total params
51.831    Total estimated model params size (MB)
loading annotations into memory...
Done (t=1.30s)
creating index...
index created!
Epoch 0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:25<00:00,  2.11s/it, loss=0.861, v_num=0, train_ver_loss=0.205, train_seg_loss=0.656]
FIT Profiler Report

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Action                                                                                                                                                                                                            |  Mean duration (s)    |  Num calls            |  Total time (s)       |  Percentage %         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Total                                                                                                                                                                                                             |  -                    |  574                  |  29.691               |  100 %                |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  run_training_epoch                                                                                                                                                                                                |  25.136               |  1                    |  25.136               |  84.66                |
|  run_training_batch                                                                                                                                                                                                |  1.7876               |  12                   |  21.451               |  72.247               |
|  [LightningModule]LitPvnet.optimizer_step                                                                                                                                                                          |  1.787                |  12                   |  21.444               |  72.225               |
|  [Strategy]SingleDeviceStrategy.backward                                                                                                                                                                           |  1.1394               |  12                   |  13.673               |  46.05                |
|  [Strategy]SingleDeviceStrategy.training_step                                                                                                                                                                      |  0.6367               |  12                   |  7.6404               |  25.733               |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_epoch_end                         |  0.23877              |  1                    |  0.23877              |  0.80419              |
|  on_train_batch_end                                                                                                                                                                                                |  0.0010752            |  12                   |  0.012903             |  0.043458             |

1.5.0

> pip install pytorch-lightning==1.5.0
...
> python run.py
{'optimizers': {'lr': 0.001, 'weight_decay': 0}, 'train': {'milestones': [20, 40, 60, 80, 100, 120, 160, 180, 200, 220], 'gamma': 0.5, 'batch_size': 16}, 'general': {'save_dir': 'logs'}, 'trainer': {'gpus': [0], 'accelerator': 'gpu', 'max_epochs': 1, 'val_check_interval': 1, 'limit_train_batches': 0.01, 'limit_val_batches': 0.01, 'profiler': 'simple'}, 'data': {'cls_type': 'cat'}, 'exp_name': 'test'}
Global seed set to 1234
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type               | Params
-------------------------------------------------
0 | model     | PvnetModelResnet18 | 13.0 M
1 | vote_crit | SmoothL1Loss       | 0     
2 | seg_crit  | CrossEntropyLoss   | 0     
-------------------------------------------------
13.0 M    Trainable params
0         Non-trainable params
13.0 M    Total params
51.831    Total estimated model params size (MB)
loading annotations into memory...
Done (t=1.37s)
creating index...
index created!
Epoch 0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:09<00:00,  1.23it/s, loss=0.861, v_num=0, train_ver_loss=0.205, train_seg_loss=0.656]
FIT Profiler Report

Action                                  |  Mean duration (s)    |Num calls              |  Total time (s)       |  Percentage %         |
--------------------------------------------------------------------------------------------------------------------------------------
Total                                   |  -                    |_                      |  14.094               |  100 %                |
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                      |  9.7956               |1                      |  9.7956               |  69.5                 |
run_training_batch                      |  0.44689              |12                     |  5.3626               |  38.048               |
get_train_batch                         |  0.23755              |13                     |  3.0882               |  21.911               |
fetch_next_train_batch                  |  0.23753              |13                     |  3.0879               |  21.909               |
optimizer_step_with_closure_0           |  0.2286               |12                     |  2.7432               |  19.463               |
training_step_and_backward              |  0.2224               |12                     |  2.6688               |  18.935               |
model_forward                           |  0.18607              |12                     |  2.2329               |  15.842               |
training_step                           |  0.18589              |12                     |  2.2306               |  15.827               |
backward                                |  0.035633             |12                     |  0.4276               |  3.0338               |
on_train_epoch_end                      |  0.4268               |1                      |  0.4268               |  3.0282               |
on_train_batch_end                      |  0.0063143            |12                     |  0.075772             |  0.53761              |

Environment

  • PyTorch Version: 1.8.1+cu111
  • Python version: 3.8.13
  • OS : Ubuntu
  • CUDA/cuDNN version: 11.4
  • How you installed PyTorch (conda, pip, source): conda

cc @borda @akihironitta

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

7reactions
carmoccacommented, May 25, 2022

Hi @piraka9011! Sorry for the frustration caused.

We did not foresee the problems that could be caused by setting this flag by default. This was implemented in https://github.com/PyTorchLightning/pytorch-lightning/pull/11944 and mentioned in the β€œChanged” section of the release.

I am leaning towards defaulting to what torch defaults to which is False, meaning, reverting this change.

Does anybody from @PyTorchLightning/core-lightning object to this? This β€œfix” can be done by reviving this PR: https://github.com/PyTorchLightning/pytorch-lightning/pull/12020

2reactions
akihironittacommented, May 24, 2022

Hi @piraka9011, thank you for the detailed information provided in the linked issue.

One possible cause is that we enabled Trainer(benchmark=True) by default from v1.6. It can lead to slower training if the input size varies frequently, and it may also lead to more memory consumption depending on the algorithms.

refs:

Could you try disabling it (Trainer(benchmark=False)) and check the performance again?


@flyinghu123 @piraka9011 Do you see the regression when running without GPUs, too?

Read more comments on GitHub >

github_iconTop Results From Across the Web

v1.6 is slower than v1.5 Β· Issue #12713 Β· Lightning-AI ... - GitHub
I used the same code and parameters on version 1.5.0 and version 1.6.0. 1.6.0 is 5 times slower than 1.5.0. To Reproduce. 1.6.0....
Read more >
ePSXe v2 running much slower than v1.6 - NGEmu
Hello! ePSXe 2.0.5 is running very slow at my machine. I am using Video-plugin Pete's Open GL2 Driver 2.9, Desktop = 3840x2160,Β ...
Read more >
Solved: Windows 10 Graphics within Guest after Upgrade 15
The Graphics within this workstations are unusable slow (takes > 10 seconds in the File Explorer). Downgrading to an 15 version was a...
Read more >
PLC Sim V15 very slow - 198347 - Industry Support Siemens
But other versons of TIA portals are more faster than V15. 1. Go to My computer properties> System setting>Performance> Best performance.
Read more >
CSDID Version 1.6 - Playing with Stata
So why was the older CSDID slow? As I mentioned before, CSDID works together with DRDID to obtain the best estimate for treatment...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found