Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] getting started `03-Training-with-TF` nb gives OOM on a 16 GB GPU

See original GitHub issue

Describe the bug

I am getting OOM issue when I train 03-Training-with-TF nb on a 16 GB GPU, and this can be problematic for users who are running these notebooks in the cloud with GPU memory sizes.

I can avoid OOM if I comment out os.environ["TF_MEMORY_ALLOCATION"] = "0.7" line. It works fine then.

Steps/Code to reproduce bug

Run 03-Training-with-TF to repro.

Expected behavior Should run without OOM. I recommend to remove os.environ["TF_MEMORY_ALLOCATION"] = "0.7" line from the example nb.

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of NVTabular install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used

I am using merlin-tensorflow-training:22.04.

Issue Analytics

State:
Created a year ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

rnyakcommented, May 4, 2022

@EvenOldridge and @karlhigley My understanding is thatos.environ["TF_MEMORY_ALLOCATION"] = "0.5" is currently the default behavior under the hood (because of configure_tensorflow()) if we use NVT Keras dataloader. So I dont think we need to define it again in the notebook.

1reaction

karlhigleycommented, Apr 22, 2022

Which example is this for? Let me see how it goes on an 11GB GPU. 😺