Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory usage and training time increases a lot after epoch 0.

See original GitHub issue

🐛 Bug

When I run training, epoch 0 is normal, which has a steady memory usage of around 20G and a training time of about 1.5 hours. But for epoch 1, the memory usage keeps increasing severely from 20G to maybe 60G as the training goes on, and the training time also increases a lot from about 1.75h to maybe 2.75h. Finally, memory usage surpassed the extreme, and the program was killed. The training_step code in epoch 0 and epoch 1 is exactly the same. What’s the possible reason for the problem?

Epoch 0

Epoch 1

To reproduce

trainer = pl.Trainer(
        gpus=1,
        logger=experiment_loggers,
        max_epochs=[hparams.TRAINING.MAX_EPOCHS],
        callbacks=[ckpt_callback],
        log_every_n_steps=50,
        terminate_on_nan=True,
        default_root_dir=log_dir,
        progress_bar_refresh_rate=1,
        check_val_every_n_epoch=1,
        reload_dataloaders_every_n_epochs=1,
        num_sanity_val_steps=0,
        fast_dev_run=fast_dev_run,
        **amp_params,
    )

Expected behavior

Epoch 1’s memory usage and training time should keep almost the same as epoch 0.

Environment

* CUDA:
	- GPU:
		- NVIDIA GeForce RTX 3080 Ti
	- available:         True
	- version:           11.1
* Lightning:
	- neural-renderer-pytorch: 1.1.3
	- pytorch-lightning: 1.6.0
	- pytorch3d:         0.6.1
	- torch:             1.8.2
	- torchgeometry:     0.1.2
	- torchmetrics:      0.7.2
	- torchvision:       0.9.2
* Packages:
	- absl-py:           1.0.0
	- addict:            2.4.0
	- aiohttp:           3.8.1
	- aiosignal:         1.2.0
	- albumentations:    1.2.1
	- anyio:             3.5.0
	- argon2-cffi:       21.3.0
	- argon2-cffi-bindings: 21.2.0
	- asgiref:           3.5.0
	- asttokens:         2.0.5
	- async-timeout:     4.0.2
	- asyncer:           0.0.1
	- attrs:             21.4.0
	- autobahn:          22.5.1
	- automat:           20.2.0
	- babel:             2.10.1
	- backcall:          0.2.0
	- beautifulsoup4:    4.10.0
	- bleach:            4.1.0
	- build:             0.8.0
	- cachetools:        5.0.0
	- certifi:           2021.10.8
	- cffi:              1.15.0
	- charset-normalizer: 2.0.12
	- chumpy:            0.70
	- click:             8.0.3
	- colorama:          0.4.4
	- conda-pack:        0.6.0
	- constantly:        15.1.0
	- cryptography:      37.0.2
	- cycler:            0.11.0
	- cython:            0.29.20
	- dataclasses:       0.8
	- decorator:         5.1.1
	- defusedxml:        0.7.1
	- deprecated:        1.2.13
	- deprecation:       2.1.0
	- entrypoints:       0.4
	- executing:         0.8.3
	- fastapi:           0.72.0
	- fastjsonschema:    2.15.3
	- filelock:          3.6.0
	- filetype:          1.0.9
	- filterpy:          1.4.5
	- flatbuffers:       2.0
	- flatten-dict:      0.4.2
	- fonttools:         4.29.1
	- freetype-py:       2.3.0
	- frozenlist:        1.3.0
	- fsspec:            2022.2.0
	- future:            0.18.2
	- fvcore:            0.1.5.post20220305
	- gdown:             4.4.0
	- google-auth:       2.6.0
	- google-auth-oauthlib: 0.4.6
	- grpcio:            1.44.0
	- h11:               0.13.0
	- human-det:         0.0.2
	- hyperlink:         21.0.0
	- idna:              3.3
	- imageio:           2.16.1
	- importlib-metadata: 4.11.2
	- importlib-resources: 5.4.0
	- incremental:       21.3.0
	- iopath:            0.1.9
	- ipdb:              0.13.9
	- ipykernel:         5.3.4
	- ipython:           8.1.1
	- ipython-genutils:  0.2.0
	- ipywidgets:        7.6.5
	- jedi:              0.18.1
	- jinja2:            3.0.3
	- joblib:            1.1.0
	- jpeg4py:           0.1.4
	- json5:             0.9.6
	- jsonschema:        4.4.0
	- jupyter-client:    7.1.2
	- jupyter-core:      4.9.2
	- jupyter-packaging: 0.12.0
	- jupyter-server:    1.16.0
	- jupyterlab:        3.3.4
	- jupyterlab-pygments: 0.1.2
	- jupyterlab-server: 2.13.0
	- jupyterlab-widgets: 1.0.2
	- kaolin:            0.10.0
	- kiwisolver:        1.3.2
	- kornia:            0.6.3
	- llvmlite:          0.38.0
	- loguru:            0.5.3
	- markdown:          3.3.6
	- markupsafe:        2.1.0
	- matplotlib:        3.5.0
	- matplotlib-inline: 0.1.3
	- mistune:           0.8.4
	- mkl-fft:           1.3.1
	- mkl-random:        1.2.2
	- mkl-service:       2.4.0
	- multi-person-tracker: 0.1
	- multidict:         6.0.2
	- nbclassic:         0.3.7
	- nbclient:          0.5.12
	- nbconvert:         6.5.0
	- nbformat:          5.3.0
	- nest-asyncio:      1.5.4
	- networkx:          2.7.1
	- neural-renderer-pytorch: 1.1.3
	- notebook:          6.4.8
	- notebook-shim:     0.1.0
	- numba:             0.55.1
	- numpy:             1.21.2
	- oauthlib:          3.2.0
	- olefile:           0.46
	- onnxruntime:       1.10.0
	- open3d:            0.15.2
	- opencv-contrib-python: 4.5.5.62
	- opencv-python:     4.5.5.62
	- opencv-python-headless: 4.6.0.66
	- packaging:         21.3
	- pandas:            1.4.2
	- pandocfilters:     1.5.0
	- parso:             0.8.3
	- pep517:            0.13.0
	- pexpect:           4.8.0
	- pickleshare:       0.7.5
	- pillow:            9.0.0
	- pip:               22.0.4
	- pip-tools:         6.8.0
	- portalocker:       2.4.0
	- prometheus-client: 0.13.1
	- prompt-toolkit:    3.0.28
	- protobuf:          3.19.4
	- ptyprocess:        0.7.0
	- pure-eval:         0.2.2
	- pyasn1:            0.4.8
	- pyasn1-modules:    0.2.8
	- pycparser:         2.21
	- pydantic:          1.9.0
	- pydeprecate:       0.3.2
	- pyembree:          0.1.6
	- pyglet:            1.5.23
	- pygments:          2.11.2
	- pymatting:         1.1.5
	- pymcubes:          0.1.2
	- pymeshlab:         2022.2.post2
	- pyopengl:          3.1.0
	- pyopengl-accelerate: 3.1.5
	- pyparsing:         3.0.7
	- pyquaternion:      0.9.9
	- pyrender:          0.1.45
	- pyrsistent:        0.18.1
	- pysocks:           1.7.1
	- python-dateutil:   2.8.2
	- python-multipart:  0.0.5
	- pytorch-lightning: 1.6.0
	- pytorch3d:         0.6.1
	- pytube:            12.1.0
	- pytz:              2022.1
	- pywavelets:        1.2.0
	- pyyaml:            6.0
	- pyzmq:             22.3.0
	- qudida:            0.0.4
	- rembg:             2.0.8
	- requests:          2.27.1
	- requests-oauthlib: 1.3.1
	- rsa:               4.8
	- rtree:             0.9.7
	- scikit-image:      0.19.1
	- scikit-learn:      1.0.2
	- scipy:             1.5.2
	- send2trash:        1.8.0
	- setuptools:        60.9.3
	- setuptools-scm:    6.4.2
	- shapely:           1.7.1
	- six:               1.16.0
	- smplx:             0.1.26
	- sniffio:           1.2.0
	- soupsieve:         2.3.1
	- stack-data:        0.2.0
	- starlette:         0.17.1
	- tabulate:          0.8.9
	- tensorboard:       2.8.0
	- tensorboard-data-server: 0.6.1
	- tensorboard-plugin-wit: 1.8.1
	- termcolor:         1.1.0
	- terminado:         0.13.3
	- testpath:          0.6.0
	- threadpoolctl:     3.1.0
	- tifffile:          2022.2.9
	- tinycss2:          1.1.1
	- toml:              0.10.2
	- tomli:             2.0.1
	- tomlkit:           0.10.2
	- torch:             1.8.2
	- torchgeometry:     0.1.2
	- torchmetrics:      0.7.2
	- torchvision:       0.9.2
	- tornado:           6.1
	- tqdm:              4.62.3
	- traitlets:         5.1.1
	- trimesh:           3.9.35
	- twisted:           22.4.0
	- txaio:             22.2.1
	- typing-extensions: 4.1.1
	- urllib3:           1.26.8
	- usd-core:          22.3
	- uvicorn:           0.17.0
	- vedo:              2022.2.3
	- voxelize-cuda:     0.0.0
	- vtk:               9.0.3
	- wcwidth:           0.2.5
	- webencodings:      0.5.1
	- websocket-client:  1.3.2
	- werkzeug:          2.0.3
	- wheel:             0.37.1
	- widgetsnbextension: 3.5.2
	- wrapt:             1.14.1
	- wslink:            1.4.3
	- yacs:              0.1.8
	- yarl:              1.7.2
	- yolov3:            0.1
	- zipp:              3.7.0
	- zope.interface:    5.4.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.12
	- version:           #140~18.04.1-Ubuntu SMP Fri Aug 5 11:43:34 UTC 2022

cc @borda @akihironitta

Issue Analytics

State:
Created a year ago
Comments:16 (7 by maintainers)

Top GitHub Comments

1reaction

MooreManorcommented, Aug 23, 2022

@awaelchli

https://github.com/mkocabas/PARE/blob/5278450e08189dbc25487a28d93c13942182ed6a/pare/core/trainer.py#L559

I commented this line of code which appends (b, 6890, 3) vertices locations in self.evaluation_results, and the programs went well.

https://github.com/mkocabas/PARE/blob/5278450e08189dbc25487a28d93c13942182ed6a/pare/core/trainer.py#L788

Although this line of code recycles the memory usage of self.evaluation_results by setting a null value to it at the validation_epoch_end, it’s weird to find that the memory didn’t go down and kept growing up. And this variant is not referenced by other variants. I’m still trying to find why the commented line of code caused the bug.

1reaction

MooreManorcommented, Aug 19, 2022

@awaelchli

Thanks for your detailed advice!

I have tried option 1, but the memory usage still went up. I want to confirm one thing once the dataset is loaded completely (i.e. the dataloader has gone through every item in the dataset), will the lightning_module.train_dataloader() be called again?

I will try solution 2 next.