question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOM and CUDA Error's

See original GitHub issue

Hi,

I am getting an OOM error on Colab (P-100 16 GB RAM) with the following:

cd DiffAugment-stylegan2
python run_few_shot.py --dataset=100-shot-obama --num-gpus=1
Traceback (most recent call last):
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node GPU0/loss/D_1/256x256/Conv0/FusedBiasAct}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[TrainG/Apply0/cond_111/pred_id/_2541]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node GPU0/loss/D_1/256x256/Conv0/FusedBiasAct}}]]

So I tried it on 8xV-100’s. It gave an OOM error with my dataset at 1024 but the obama dataset reached till below and then gave another error.

tick 0     kimg 0.1      lod 0.00  minibatch 32   time 49s          sec/tick 49.1    sec/kimg 383.77  maintenance 0.0    gpumem 6.3
Downloading http://d36zk2xti64re0.cloudfront.net/stylegan1/networks/metrics/inception_v3_features.pkl ... done
network-snapshot-000000        time 3m 09s       fid5k-train 396.6058
2020-07-02 13:34:00.935725: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-07-02 13:34:00.935780: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Aborted (core dumped)

CUDA details:

nvcc: NVIDIA ® Cuda compiler driver Copyright © 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130

What is the max resolution supported on 16 GB RAM? Sorry to mix two issues. I can open a separate issue for the CUDA error if needed.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11

github_iconTop GitHub Comments

2reactions
zsyzzsoftcommented, Jul 9, 2020

There might be some miscommunication here. Let me try to clarify them here. Feel free to correct me.

  1. For the truncation trick, truncation was only used in StyleGAN2 in some figures (e.g., Fig 12). It is not used in their FID calculation. We also don’t use truncation in our FID calculation and figures.
  2. There are multiple experiments that involve StyleGAN2 in the paper. Not sure which one you (@andersonfaaria) want to replicate. For CIFAR 10 and 100 experiments, as images are smaller, we modify the architectures and reduce the capacity. It is detailed in Appendix A.2. We also use different losses (which are commonly used in CIFAR experiments)
  3. For few-shot generation experiments, we mostly follow StyleGAN2 losses and architectures and only make small adjustments. (e.g., gamma)

Thank you so much, at first I thought you were only using bigGAN for CIFAR but it was my mistake while reading. To put in simpler terms I’m actually trying to have a version of stylegan2 that works best for a problem I’m trying to tackle. I’ll try to give more background because I’m really amazed by the results you managed to achieve with this version:

  • I have a very diverse dataset of ~10k pixel art in 32x32 format with pink background. They are basically several different items and creatures from open source databases that I crawled. Due being so diverse, I’ve used a classificator to actually separate those in weapons, armor parts, creatures and miscellanous. Inside those separations we have what kind of weapon they are or what kind of armor part (helmet, armor, legs, boots, shields) and so on.

If we consider all weapons, I have around ~300 but if we consider one of kind I have a little less than 100 samples of each kind. It seems at least to me that I would benefit more training each of them individually rather than considering all of once.

I’ve been training with stylegan2 for a few months now but even using 0.001 as learning rate it usually tends to model collapse with < 100 samples. Increasing batch size also increase model collapsing and decreasing it turned out to be OK but actually made the GAN pratically overfitted (to the point the results were no longer interpretable). Since my dataset is so small, it seems perfect to work with the approach you were showing.

My question is: from your experience, do you think I should follow the approach on CIFAR 10 described in A2? My goal is to minimize noisy results and generate outputs that don’t look like copies of trained data.

I’d really appreciate if you could share some insights.

The adaptations described in A2 are all minor compared to the DiffAugment. So the first step for you, I recommend, is to throw away all those hyperparameter changes, and just apply the strongest DiffAugment and see whether it works. After that, there are tons of hyperparameters to play with, but gamma and batch size seem to affect the most (for 32x32, a small gamma e.g. 0.1 is likely to be a better choice). Maybe you can also begin with run_cifar.py or run_few_shot.py.

1reaction
junyanzcommented, Jul 8, 2020

There might be some miscommunication here. Let me try to clarify them here. Feel free to correct me.

  1. For the truncation trick, truncation was only used in StyleGAN2 in some figures (e.g., Fig 12). It is not used in their FID calculation. We also don’t use truncation in our FID calculation and figures.

  2. There are multiple experiments that involve StyleGAN2 in the paper. Not sure which one you (@andersonfaaria) want to replicate. For CIFAR 10 and 100 experiments, as images are smaller, we modify the architectures and reduce the capacity. It is detailed in Appendix A.2. We also use different losses (which are commonly used in CIFAR experiments)

  3. For few-shot generation experiments, we mostly follow StyleGAN2 losses and architectures and only make small adjustments. (e.g., gamma)

Read more comments on GitHub >

github_iconTop Results From Across the Web

"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >
Solving "CUDA out of memory" Error - Kaggle
Solving "CUDA out of memory" Error · 1) Use this code to see memory usage (it requires internet to install package): · 2)...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
RuntimeError: CUDA error: out of memory. There's nothing to explain actually, I mean the error message is already self-explanatory, ...
Read more >
CUDA OOM error leads to GPU memory leak #359 - GitHub
1 (current ME master) to solve an issue with uncatchable exceptions. We encountered a memory leak when there is an exception in C++...
Read more >
Cuda OOM Error When Finetuning GPT Neo 2.7B - Beginners
I'm trying to finetune the 2.7B model with some data I gathered. I'm running on Google Colab Pro with a T-100 16GB.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found