CUDA_ERROR_LAUNCH_FAILED when training on GPU locally
See original GitHub issueHi, I’m trying to train a model locally (adapting the code from train_autoencoder.ipynb), and I’m getting the error in the title just before the model is supposed to start training. I will copy the complete log below. My configuration is as follows:
- Tensorflow 2.1
- CUDA 10.1
- cudnn 7.6.5 for CUDA 10.1
2020-02-21 13:39:39.259132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:41.110202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
I0221 13:39:43.156791 2672 train_util.py:56] Defaulting to MirroredStrategy
2020-02-21 13:39:43.164404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-02-21 13:39:43.237886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.241122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.246274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.250949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.253287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.257189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.261498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.269133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.271574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.272927: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-02-21 13:39:43.275556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.278705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.280447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.282142: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.283834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.285671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.287438: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.289994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.291835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.970857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-21 13:39:43.973353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] 0
2020-02-21 13:39:43.974871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0: N
2020-02-21 13:39:43.976781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6306 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:43.974044 2672 mirrored_strategy.py:501] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:44.343264 2672 train_util.py:201] Building the model...
WARNING:tensorflow:From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0221 13:39:48.817270 3952 deprecation.py:506] From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-02-21 13:39:52.821030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:53.103556: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:53.327462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
I0221 13:39:54.833573 2672 train_util.py:172] Restoring from checkpoint...
I0221 13:39:54.833573 2672 train_util.py:184] No checkpoint, skipping.
I0221 13:39:54.833573 2672 train_util.py:256] Creating metrics for ListWrapper(['spectral_loss', 'total_loss'])
2020-02-21 13:40:02.551385: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-02-21 13:40:02.554137: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Fatal Python error: Aborted
Thread 0x00000a70 (most recent call first):
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\execute.py", line 60 in quick_execute
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 598 in call
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1741 in _call_flat
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1660 in _filtered_call
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 646 in _call
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 576 in __call__
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\train_util.py", line 273 in train
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\gin\config.py", line 1055 in gin_wrapper
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 151 in main
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 250 in _run_main
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 299 in run
File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 172 in console_entry_point
File "C:\Users\andrey\Anaconda3\envs\test\Scripts\ddsp_run.exe\__main__.py", line 7 in <module>
File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 85 in _run_code
File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 193 in _run_module_as_main
I can’t point my finger on where’s the problem because:
- Tensorflow trains on GPU correctly with a toy example training, so it is configured correctly to work with CUDA
- Tensorflow trains DDSP correctly if run on CPU
This is with a Windows system. On Ubuntu the situation was the same, but I was getting the following error:
Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Any help will be appreciated.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Top Results From Across the Web
Crash on training (CUDA_ERROR_LAUNCH_FAILED) - cuDNN
Ok. My experience is on a slightly different problem – the error is not always repeated but frequently happened. The resolution we had...
Read more >Blas xGEMV launch failed error when running on GPU. #52222
I've ran into a really strange error that is incredibly difficult to track down. I have a version of Tensorflow (2.6.0-rc1) using Cuda...
Read more >CUDA_ERROR_LAUNCH_FAIL...
Most of the times I'm not able to train the model as it gets stuck in the middle of the first epoch, or...
Read more >CUDA_ERROR_LAUNCH_FAIL...
I have trained networks (trainNetwork()) on my GPU with MATLAB R2018b for over a year without any issues. Since when I upgraded to...
Read more >CUDA error: unspecified launch failure - PyTorch Forums
I'm getting a CUDA error: unspecified launch failure error when training for longer periods of time. When I was training for few hours...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for looking into this!
It seems you’re using a GPU with about half what we’ve been testing on (v100), so sorry you bumped into this edge case.
I am a little confused why that code snippet works (since we don’t use sessions in 2.0), but I assume it’s somehow tapping into the same backend. Can you try the TF 2.0 code from https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth and see if it works for you too?
I got a similar issue while training on a T4
failed to initialize batched cufft plan with customized allocator: Failed to make cuFFT batched plan. Fatal Python error: Aborted
The code suggested by jesseengel (https://github.com/magenta/ddsp/issues/29#issuecomment-591199173) fixed the issue.