OOM and CUDA Error's
See original GitHub issueHi,
I am getting an OOM error on Colab (P-100 16 GB RAM) with the following:
cd DiffAugment-stylegan2
python run_few_shot.py --dataset=100-shot-obama --num-gpus=1
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node GPU0/loss/D_1/256x256/Conv0/FusedBiasAct}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[TrainG/Apply0/cond_111/pred_id/_2541]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node GPU0/loss/D_1/256x256/Conv0/FusedBiasAct}}]]
So I tried it on 8xV-100’s. It gave an OOM error with my dataset at 1024 but the obama dataset reached till below and then gave another error.
tick 0 kimg 0.1 lod 0.00 minibatch 32 time 49s sec/tick 49.1 sec/kimg 383.77 maintenance 0.0 gpumem 6.3
Downloading http://d36zk2xti64re0.cloudfront.net/stylegan1/networks/metrics/inception_v3_features.pkl ... done
network-snapshot-000000 time 3m 09s fid5k-train 396.6058
2020-07-02 13:34:00.935725: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-07-02 13:34:00.935780: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Aborted (core dumped)
CUDA details:
nvcc: NVIDIA ® Cuda compiler driver Copyright © 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130
What is the max resolution supported on 16 GB RAM? Sorry to mix two issues. I can open a separate issue for the CUDA error if needed.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11
Top Results From Across the Web
"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >Solving "CUDA out of memory" Error - Kaggle
Solving "CUDA out of memory" Error · 1) Use this code to see memory usage (it requires internet to install package): · 2)...
Read more >Resolving CUDA Being Out of Memory With Gradient ...
RuntimeError: CUDA error: out of memory. There's nothing to explain actually, I mean the error message is already self-explanatory, ...
Read more >CUDA OOM error leads to GPU memory leak #359 - GitHub
1 (current ME master) to solve an issue with uncatchable exceptions. We encountered a memory leak when there is an exception in C++...
Read more >Cuda OOM Error When Finetuning GPT Neo 2.7B - Beginners
I'm trying to finetune the 2.7B model with some data I gathered. I'm running on Google Colab Pro with a T-100 16GB.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The adaptations described in A2 are all minor compared to the DiffAugment. So the first step for you, I recommend, is to throw away all those hyperparameter changes, and just apply the strongest DiffAugment and see whether it works. After that, there are tons of hyperparameters to play with, but gamma and batch size seem to affect the most (for 32x32, a small gamma e.g. 0.1 is likely to be a better choice). Maybe you can also begin with run_cifar.py or run_few_shot.py.
There might be some miscommunication here. Let me try to clarify them here. Feel free to correct me.
For the truncation trick, truncation was only used in StyleGAN2 in some figures (e.g., Fig 12). It is not used in their FID calculation. We also don’t use truncation in our FID calculation and figures.
There are multiple experiments that involve StyleGAN2 in the paper. Not sure which one you (@andersonfaaria) want to replicate. For CIFAR 10 and 100 experiments, as images are smaller, we modify the architectures and reduce the capacity. It is detailed in Appendix A.2. We also use different losses (which are commonly used in CIFAR experiments)
For few-shot generation experiments, we mostly follow StyleGAN2 losses and architectures and only make small adjustments. (e.g., gamma)