Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally

See original GitHub issue

Hi, I’m trying to train a model locally (adapting the code from train_autoencoder.ipynb), and I’m getting the error in the title just before the model is supposed to start training. I will copy the complete log below. My configuration is as follows:

Tensorflow 2.1
CUDA 10.1
cudnn 7.6.5 for CUDA 10.1

2020-02-21 13:39:39.259132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:41.110202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
I0221 13:39:43.156791  2672 train_util.py:56] Defaulting to MirroredStrategy
2020-02-21 13:39:43.164404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-02-21 13:39:43.237886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.241122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.246274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.250949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.253287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.257189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.261498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.269133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.271574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.272927: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-02-21 13:39:43.275556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.278705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.280447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.282142: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.283834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.285671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.287438: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.289994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.291835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.970857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-21 13:39:43.973353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0
2020-02-21 13:39:43.974871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N
2020-02-21 13:39:43.976781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6306 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:43.974044  2672 mirrored_strategy.py:501] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:44.343264  2672 train_util.py:201] Building the model...
WARNING:tensorflow:From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0221 13:39:48.817270  3952 deprecation.py:506] From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-02-21 13:39:52.821030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:53.103556: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:53.327462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
I0221 13:39:54.833573  2672 train_util.py:172] Restoring from checkpoint...
I0221 13:39:54.833573  2672 train_util.py:184] No checkpoint, skipping.
I0221 13:39:54.833573  2672 train_util.py:256] Creating metrics for ListWrapper(['spectral_loss', 'total_loss'])
2020-02-21 13:40:02.551385: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-02-21 13:40:02.554137: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Fatal Python error: Aborted

Thread 0x00000a70 (most recent call first):
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\execute.py", line 60 in quick_execute
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 598 in call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1741 in _call_flat
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1660 in _filtered_call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 646 in _call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 576 in __call__
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\train_util.py", line 273 in train
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\gin\config.py", line 1055 in gin_wrapper
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 151 in main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 250 in _run_main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 299 in run
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 172 in console_entry_point
  File "C:\Users\andrey\Anaconda3\envs\test\Scripts\ddsp_run.exe\__main__.py", line 7 in <module>
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 85 in _run_code
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 193 in _run_module_as_main

I can’t point my finger on where’s the problem because:

Tensorflow trains on GPU correctly with a toy example training, so it is configured correctly to work with CUDA
Tensorflow trains DDSP correctly if run on CPU

This is with a Windows system. On Ubuntu the situation was the same, but I was getting the following error: Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Any help will be appreciated.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:8 (7 by maintainers)

Top GitHub Comments

2reactions

jesseengelcommented, Feb 26, 2020

Thanks for looking into this!

It seems you’re using a GPU with about half what we’ve been testing on (v100), so sorry you bumped into this edge case.

I am a little confused why that code snippet works (since we don’t use sessions in 2.0), but I assume it’s somehow tapping into the same backend. Can you try the TF 2.0 code from https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth and see if it works for you too?

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

0reactions

erl-jcommented, May 8, 2020

I got a similar issue while training on a T4

failed to initialize batched cufft plan with customized allocator: Failed to make cuFFT batched plan. Fatal Python error: Aborted

The code suggested by jesseengel (https://github.com/magenta/ddsp/issues/29#issuecomment-591199173) fixed the issue.