memory usage and training time increases a lot after epoch 0.
See original GitHub issueš Bug
When I run training, epoch 0 is normal, which has a steady memory usage of around 20G and a training time of about 1.5 hours. But for epoch 1, the memory usage keeps increasing severely from 20G to maybe 60G as the training goes on, and the training time also increases a lot from about 1.75h to maybe 2.75h. Finally, memory usage surpassed the extreme, and the program was killed. The training_step
code in epoch 0 and epoch 1 is exactly the same. Whatās the possible reason for the problem?
Epoch 0
Epoch 1
To reproduce
trainer = pl.Trainer(
gpus=1,
logger=experiment_loggers,
max_epochs=[hparams.TRAINING.MAX_EPOCHS],
callbacks=[ckpt_callback],
log_every_n_steps=50,
terminate_on_nan=True,
default_root_dir=log_dir,
progress_bar_refresh_rate=1,
check_val_every_n_epoch=1,
reload_dataloaders_every_n_epochs=1,
num_sanity_val_steps=0,
fast_dev_run=fast_dev_run,
**amp_params,
)
Expected behavior
Epoch 1ās memory usage and training time should keep almost the same as epoch 0.
Environment
* CUDA:
- GPU:
- NVIDIA GeForce RTX 3080 Ti
- available: True
- version: 11.1
* Lightning:
- neural-renderer-pytorch: 1.1.3
- pytorch-lightning: 1.6.0
- pytorch3d: 0.6.1
- torch: 1.8.2
- torchgeometry: 0.1.2
- torchmetrics: 0.7.2
- torchvision: 0.9.2
* Packages:
- absl-py: 1.0.0
- addict: 2.4.0
- aiohttp: 3.8.1
- aiosignal: 1.2.0
- albumentations: 1.2.1
- anyio: 3.5.0
- argon2-cffi: 21.3.0
- argon2-cffi-bindings: 21.2.0
- asgiref: 3.5.0
- asttokens: 2.0.5
- async-timeout: 4.0.2
- asyncer: 0.0.1
- attrs: 21.4.0
- autobahn: 22.5.1
- automat: 20.2.0
- babel: 2.10.1
- backcall: 0.2.0
- beautifulsoup4: 4.10.0
- bleach: 4.1.0
- build: 0.8.0
- cachetools: 5.0.0
- certifi: 2021.10.8
- cffi: 1.15.0
- charset-normalizer: 2.0.12
- chumpy: 0.70
- click: 8.0.3
- colorama: 0.4.4
- conda-pack: 0.6.0
- constantly: 15.1.0
- cryptography: 37.0.2
- cycler: 0.11.0
- cython: 0.29.20
- dataclasses: 0.8
- decorator: 5.1.1
- defusedxml: 0.7.1
- deprecated: 1.2.13
- deprecation: 2.1.0
- entrypoints: 0.4
- executing: 0.8.3
- fastapi: 0.72.0
- fastjsonschema: 2.15.3
- filelock: 3.6.0
- filetype: 1.0.9
- filterpy: 1.4.5
- flatbuffers: 2.0
- flatten-dict: 0.4.2
- fonttools: 4.29.1
- freetype-py: 2.3.0
- frozenlist: 1.3.0
- fsspec: 2022.2.0
- future: 0.18.2
- fvcore: 0.1.5.post20220305
- gdown: 4.4.0
- google-auth: 2.6.0
- google-auth-oauthlib: 0.4.6
- grpcio: 1.44.0
- h11: 0.13.0
- human-det: 0.0.2
- hyperlink: 21.0.0
- idna: 3.3
- imageio: 2.16.1
- importlib-metadata: 4.11.2
- importlib-resources: 5.4.0
- incremental: 21.3.0
- iopath: 0.1.9
- ipdb: 0.13.9
- ipykernel: 5.3.4
- ipython: 8.1.1
- ipython-genutils: 0.2.0
- ipywidgets: 7.6.5
- jedi: 0.18.1
- jinja2: 3.0.3
- joblib: 1.1.0
- jpeg4py: 0.1.4
- json5: 0.9.6
- jsonschema: 4.4.0
- jupyter-client: 7.1.2
- jupyter-core: 4.9.2
- jupyter-packaging: 0.12.0
- jupyter-server: 1.16.0
- jupyterlab: 3.3.4
- jupyterlab-pygments: 0.1.2
- jupyterlab-server: 2.13.0
- jupyterlab-widgets: 1.0.2
- kaolin: 0.10.0
- kiwisolver: 1.3.2
- kornia: 0.6.3
- llvmlite: 0.38.0
- loguru: 0.5.3
- markdown: 3.3.6
- markupsafe: 2.1.0
- matplotlib: 3.5.0
- matplotlib-inline: 0.1.3
- mistune: 0.8.4
- mkl-fft: 1.3.1
- mkl-random: 1.2.2
- mkl-service: 2.4.0
- multi-person-tracker: 0.1
- multidict: 6.0.2
- nbclassic: 0.3.7
- nbclient: 0.5.12
- nbconvert: 6.5.0
- nbformat: 5.3.0
- nest-asyncio: 1.5.4
- networkx: 2.7.1
- neural-renderer-pytorch: 1.1.3
- notebook: 6.4.8
- notebook-shim: 0.1.0
- numba: 0.55.1
- numpy: 1.21.2
- oauthlib: 3.2.0
- olefile: 0.46
- onnxruntime: 1.10.0
- open3d: 0.15.2
- opencv-contrib-python: 4.5.5.62
- opencv-python: 4.5.5.62
- opencv-python-headless: 4.6.0.66
- packaging: 21.3
- pandas: 1.4.2
- pandocfilters: 1.5.0
- parso: 0.8.3
- pep517: 0.13.0
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 9.0.0
- pip: 22.0.4
- pip-tools: 6.8.0
- portalocker: 2.4.0
- prometheus-client: 0.13.1
- prompt-toolkit: 3.0.28
- protobuf: 3.19.4
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycparser: 2.21
- pydantic: 1.9.0
- pydeprecate: 0.3.2
- pyembree: 0.1.6
- pyglet: 1.5.23
- pygments: 2.11.2
- pymatting: 1.1.5
- pymcubes: 0.1.2
- pymeshlab: 2022.2.post2
- pyopengl: 3.1.0
- pyopengl-accelerate: 3.1.5
- pyparsing: 3.0.7
- pyquaternion: 0.9.9
- pyrender: 0.1.45
- pyrsistent: 0.18.1
- pysocks: 1.7.1
- python-dateutil: 2.8.2
- python-multipart: 0.0.5
- pytorch-lightning: 1.6.0
- pytorch3d: 0.6.1
- pytube: 12.1.0
- pytz: 2022.1
- pywavelets: 1.2.0
- pyyaml: 6.0
- pyzmq: 22.3.0
- qudida: 0.0.4
- rembg: 2.0.8
- requests: 2.27.1
- requests-oauthlib: 1.3.1
- rsa: 4.8
- rtree: 0.9.7
- scikit-image: 0.19.1
- scikit-learn: 1.0.2
- scipy: 1.5.2
- send2trash: 1.8.0
- setuptools: 60.9.3
- setuptools-scm: 6.4.2
- shapely: 1.7.1
- six: 1.16.0
- smplx: 0.1.26
- sniffio: 1.2.0
- soupsieve: 2.3.1
- stack-data: 0.2.0
- starlette: 0.17.1
- tabulate: 0.8.9
- tensorboard: 2.8.0
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- termcolor: 1.1.0
- terminado: 0.13.3
- testpath: 0.6.0
- threadpoolctl: 3.1.0
- tifffile: 2022.2.9
- tinycss2: 1.1.1
- toml: 0.10.2
- tomli: 2.0.1
- tomlkit: 0.10.2
- torch: 1.8.2
- torchgeometry: 0.1.2
- torchmetrics: 0.7.2
- torchvision: 0.9.2
- tornado: 6.1
- tqdm: 4.62.3
- traitlets: 5.1.1
- trimesh: 3.9.35
- twisted: 22.4.0
- txaio: 22.2.1
- typing-extensions: 4.1.1
- urllib3: 1.26.8
- usd-core: 22.3
- uvicorn: 0.17.0
- vedo: 2022.2.3
- voxelize-cuda: 0.0.0
- vtk: 9.0.3
- wcwidth: 0.2.5
- webencodings: 0.5.1
- websocket-client: 1.3.2
- werkzeug: 2.0.3
- wheel: 0.37.1
- widgetsnbextension: 3.5.2
- wrapt: 1.14.1
- wslink: 1.4.3
- yacs: 0.1.8
- yarl: 1.7.2
- yolov3: 0.1
- zipp: 3.7.0
- zope.interface: 5.4.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.12
- version: #140~18.04.1-Ubuntu SMP Fri Aug 5 11:43:34 UTC 2022
Issue Analytics
- State:
- Created a year ago
- Comments:16 (7 by maintainers)
Top Results From Across the Web
Memory usage and epoch iteration time increases indefinitely ...
Memory usage and epoch iteration time increases indefinitely on M1 pro MPS #77753.
Read more >CPU RAM Usage Keeps Growing as Training One Cycle
I might be seeing things but it seems that RAM usage climbs very slowly through the epoch - when I train with 200k...
Read more >GPU memory consumption increases while training
Hello, all I am new to Pytorch and I meet a strange GPU memory behavior while training a CNN model for semantic segmentation....
Read more >Keras occupies an indefinitely increasing amount of memory ...
After a few tests it looks like the amount of memory required by keras increases both between different epochs and when training differentĀ ......
Read more >Speed Up Model Training - PyTorch Lightning - Read the Docs
When you are limited with the resources, it becomes hard to speed up model training and reduce the training time without affecting the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@awaelchli
https://github.com/mkocabas/PARE/blob/5278450e08189dbc25487a28d93c13942182ed6a/pare/core/trainer.py#L559
I commented this line of code which appends (b, 6890, 3) vertices locations in
self.evaluation_results
, and the programs went well.https://github.com/mkocabas/PARE/blob/5278450e08189dbc25487a28d93c13942182ed6a/pare/core/trainer.py#L788
Although this line of code recycles the memory usage of
self.evaluation_results
by setting a null value to it at thevalidation_epoch_end
, itās weird to find that the memory didnāt go down and kept growing up. And this variant is not referenced by other variants. Iām still trying to find why the commented line of code caused the bug.@awaelchli
Thanks for your detailed advice!
I have tried option 1, but the memory usage still went up. I want to confirm one thing once the dataset is loaded completely (i.e. the dataloader has gone through every item in the dataset), will the
lightning_module.train_dataloader()
be called again?I will try solution 2 next.