question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory usage and training time increases a lot after epoch 0.

See original GitHub issue

šŸ› Bug

When I run training, epoch 0 is normal, which has a steady memory usage of around 20G and a training time of about 1.5 hours. But for epoch 1, the memory usage keeps increasing severely from 20G to maybe 60G as the training goes on, and the training time also increases a lot from about 1.75h to maybe 2.75h. Finally, memory usage surpassed the extreme, and the program was killed. The training_step code in epoch 0 and epoch 1 is exactly the same. What’s the possible reason for the problem?

Epoch 0

image

Epoch 1 image

To reproduce

trainer = pl.Trainer(
        gpus=1,
        logger=experiment_loggers,
        max_epochs=[hparams.TRAINING.MAX_EPOCHS],
        callbacks=[ckpt_callback],
        log_every_n_steps=50,
        terminate_on_nan=True,
        default_root_dir=log_dir,
        progress_bar_refresh_rate=1,
        check_val_every_n_epoch=1,
        reload_dataloaders_every_n_epochs=1,
        num_sanity_val_steps=0,
        fast_dev_run=fast_dev_run,
        **amp_params,
    )

Expected behavior

Epoch 1’s memory usage and training time should keep almost the same as epoch 0.

Environment

* CUDA:
	- GPU:
		- NVIDIA GeForce RTX 3080 Ti
	- available:         True
	- version:           11.1
* Lightning:
	- neural-renderer-pytorch: 1.1.3
	- pytorch-lightning: 1.6.0
	- pytorch3d:         0.6.1
	- torch:             1.8.2
	- torchgeometry:     0.1.2
	- torchmetrics:      0.7.2
	- torchvision:       0.9.2
* Packages:
	- absl-py:           1.0.0
	- addict:            2.4.0
	- aiohttp:           3.8.1
	- aiosignal:         1.2.0
	- albumentations:    1.2.1
	- anyio:             3.5.0
	- argon2-cffi:       21.3.0
	- argon2-cffi-bindings: 21.2.0
	- asgiref:           3.5.0
	- asttokens:         2.0.5
	- async-timeout:     4.0.2
	- asyncer:           0.0.1
	- attrs:             21.4.0
	- autobahn:          22.5.1
	- automat:           20.2.0
	- babel:             2.10.1
	- backcall:          0.2.0
	- beautifulsoup4:    4.10.0
	- bleach:            4.1.0
	- build:             0.8.0
	- cachetools:        5.0.0
	- certifi:           2021.10.8
	- cffi:              1.15.0
	- charset-normalizer: 2.0.12
	- chumpy:            0.70
	- click:             8.0.3
	- colorama:          0.4.4
	- conda-pack:        0.6.0
	- constantly:        15.1.0
	- cryptography:      37.0.2
	- cycler:            0.11.0
	- cython:            0.29.20
	- dataclasses:       0.8
	- decorator:         5.1.1
	- defusedxml:        0.7.1
	- deprecated:        1.2.13
	- deprecation:       2.1.0
	- entrypoints:       0.4
	- executing:         0.8.3
	- fastapi:           0.72.0
	- fastjsonschema:    2.15.3
	- filelock:          3.6.0
	- filetype:          1.0.9
	- filterpy:          1.4.5
	- flatbuffers:       2.0
	- flatten-dict:      0.4.2
	- fonttools:         4.29.1
	- freetype-py:       2.3.0
	- frozenlist:        1.3.0
	- fsspec:            2022.2.0
	- future:            0.18.2
	- fvcore:            0.1.5.post20220305
	- gdown:             4.4.0
	- google-auth:       2.6.0
	- google-auth-oauthlib: 0.4.6
	- grpcio:            1.44.0
	- h11:               0.13.0
	- human-det:         0.0.2
	- hyperlink:         21.0.0
	- idna:              3.3
	- imageio:           2.16.1
	- importlib-metadata: 4.11.2
	- importlib-resources: 5.4.0
	- incremental:       21.3.0
	- iopath:            0.1.9
	- ipdb:              0.13.9
	- ipykernel:         5.3.4
	- ipython:           8.1.1
	- ipython-genutils:  0.2.0
	- ipywidgets:        7.6.5
	- jedi:              0.18.1
	- jinja2:            3.0.3
	- joblib:            1.1.0
	- jpeg4py:           0.1.4
	- json5:             0.9.6
	- jsonschema:        4.4.0
	- jupyter-client:    7.1.2
	- jupyter-core:      4.9.2
	- jupyter-packaging: 0.12.0
	- jupyter-server:    1.16.0
	- jupyterlab:        3.3.4
	- jupyterlab-pygments: 0.1.2
	- jupyterlab-server: 2.13.0
	- jupyterlab-widgets: 1.0.2
	- kaolin:            0.10.0
	- kiwisolver:        1.3.2
	- kornia:            0.6.3
	- llvmlite:          0.38.0
	- loguru:            0.5.3
	- markdown:          3.3.6
	- markupsafe:        2.1.0
	- matplotlib:        3.5.0
	- matplotlib-inline: 0.1.3
	- mistune:           0.8.4
	- mkl-fft:           1.3.1
	- mkl-random:        1.2.2
	- mkl-service:       2.4.0
	- multi-person-tracker: 0.1
	- multidict:         6.0.2
	- nbclassic:         0.3.7
	- nbclient:          0.5.12
	- nbconvert:         6.5.0
	- nbformat:          5.3.0
	- nest-asyncio:      1.5.4
	- networkx:          2.7.1
	- neural-renderer-pytorch: 1.1.3
	- notebook:          6.4.8
	- notebook-shim:     0.1.0
	- numba:             0.55.1
	- numpy:             1.21.2
	- oauthlib:          3.2.0
	- olefile:           0.46
	- onnxruntime:       1.10.0
	- open3d:            0.15.2
	- opencv-contrib-python: 4.5.5.62
	- opencv-python:     4.5.5.62
	- opencv-python-headless: 4.6.0.66
	- packaging:         21.3
	- pandas:            1.4.2
	- pandocfilters:     1.5.0
	- parso:             0.8.3
	- pep517:            0.13.0
	- pexpect:           4.8.0
	- pickleshare:       0.7.5
	- pillow:            9.0.0
	- pip:               22.0.4
	- pip-tools:         6.8.0
	- portalocker:       2.4.0
	- prometheus-client: 0.13.1
	- prompt-toolkit:    3.0.28
	- protobuf:          3.19.4
	- ptyprocess:        0.7.0
	- pure-eval:         0.2.2
	- pyasn1:            0.4.8
	- pyasn1-modules:    0.2.8
	- pycparser:         2.21
	- pydantic:          1.9.0
	- pydeprecate:       0.3.2
	- pyembree:          0.1.6
	- pyglet:            1.5.23
	- pygments:          2.11.2
	- pymatting:         1.1.5
	- pymcubes:          0.1.2
	- pymeshlab:         2022.2.post2
	- pyopengl:          3.1.0
	- pyopengl-accelerate: 3.1.5
	- pyparsing:         3.0.7
	- pyquaternion:      0.9.9
	- pyrender:          0.1.45
	- pyrsistent:        0.18.1
	- pysocks:           1.7.1
	- python-dateutil:   2.8.2
	- python-multipart:  0.0.5
	- pytorch-lightning: 1.6.0
	- pytorch3d:         0.6.1
	- pytube:            12.1.0
	- pytz:              2022.1
	- pywavelets:        1.2.0
	- pyyaml:            6.0
	- pyzmq:             22.3.0
	- qudida:            0.0.4
	- rembg:             2.0.8
	- requests:          2.27.1
	- requests-oauthlib: 1.3.1
	- rsa:               4.8
	- rtree:             0.9.7
	- scikit-image:      0.19.1
	- scikit-learn:      1.0.2
	- scipy:             1.5.2
	- send2trash:        1.8.0
	- setuptools:        60.9.3
	- setuptools-scm:    6.4.2
	- shapely:           1.7.1
	- six:               1.16.0
	- smplx:             0.1.26
	- sniffio:           1.2.0
	- soupsieve:         2.3.1
	- stack-data:        0.2.0
	- starlette:         0.17.1
	- tabulate:          0.8.9
	- tensorboard:       2.8.0
	- tensorboard-data-server: 0.6.1
	- tensorboard-plugin-wit: 1.8.1
	- termcolor:         1.1.0
	- terminado:         0.13.3
	- testpath:          0.6.0
	- threadpoolctl:     3.1.0
	- tifffile:          2022.2.9
	- tinycss2:          1.1.1
	- toml:              0.10.2
	- tomli:             2.0.1
	- tomlkit:           0.10.2
	- torch:             1.8.2
	- torchgeometry:     0.1.2
	- torchmetrics:      0.7.2
	- torchvision:       0.9.2
	- tornado:           6.1
	- tqdm:              4.62.3
	- traitlets:         5.1.1
	- trimesh:           3.9.35
	- twisted:           22.4.0
	- txaio:             22.2.1
	- typing-extensions: 4.1.1
	- urllib3:           1.26.8
	- usd-core:          22.3
	- uvicorn:           0.17.0
	- vedo:              2022.2.3
	- voxelize-cuda:     0.0.0
	- vtk:               9.0.3
	- wcwidth:           0.2.5
	- webencodings:      0.5.1
	- websocket-client:  1.3.2
	- werkzeug:          2.0.3
	- wheel:             0.37.1
	- widgetsnbextension: 3.5.2
	- wrapt:             1.14.1
	- wslink:            1.4.3
	- yacs:              0.1.8
	- yarl:              1.7.2
	- yolov3:            0.1
	- zipp:              3.7.0
	- zope.interface:    5.4.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.12
	- version:           #140~18.04.1-Ubuntu SMP Fri Aug 5 11:43:34 UTC 2022

cc @borda @akihironitta

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
MooreManorcommented, Aug 23, 2022

@awaelchli

https://github.com/mkocabas/PARE/blob/5278450e08189dbc25487a28d93c13942182ed6a/pare/core/trainer.py#L559

I commented this line of code which appends (b, 6890, 3) vertices locations in self.evaluation_results, and the programs went well.

https://github.com/mkocabas/PARE/blob/5278450e08189dbc25487a28d93c13942182ed6a/pare/core/trainer.py#L788

Although this line of code recycles the memory usage of self.evaluation_results by setting a null value to it at the validation_epoch_end, it’s weird to find that the memory didn’t go down and kept growing up. And this variant is not referenced by other variants. I’m still trying to find why the commented line of code caused the bug.

1reaction
MooreManorcommented, Aug 19, 2022

@awaelchli

Thanks for your detailed advice!

I have tried option 1, but the memory usage still went up. I want to confirm one thing once the dataset is loaded completely (i.e. the dataloader has gone through every item in the dataset), will the lightning_module.train_dataloader() be called again?

I will try solution 2 next.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory usage and epoch iteration time increases indefinitely ...
Memory usage and epoch iteration time increases indefinitely on M1 pro MPS #77753.
Read more >
CPU RAM Usage Keeps Growing as Training One Cycle
I might be seeing things but it seems that RAM usage climbs very slowly through the epoch - when I train with 200k...
Read more >
GPU memory consumption increases while training
Hello, all I am new to Pytorch and I meet a strange GPU memory behavior while training a CNN model for semantic segmentation....
Read more >
Keras occupies an indefinitely increasing amount of memory ...
After a few tests it looks like the amount of memory required by keras increases both between different epochs and when training differentĀ ......
Read more >
Speed Up Model Training - PyTorch Lightning - Read the Docs
When you are limited with the resources, it becomes hard to speed up model training and reduce the training time without affecting the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found