Segmentation fault on using multiple GPUs
See original GitHub issueSegmentation Fault Error.
Training stops after 1 epoch on using multiple GPUs - showing segmentation fault. However, with the same script and model weights, training is successful in the case of a single GPU. [The memory on the Single GPU was the same as the Memory of each GPU on Multiple GPU machine = 16GB].
Basic environments:
- OS information: - OS information: Linux 5.4.0-1037-aws #39~18.04.1-Ubuntu SMP Fri Jan 15 02:48:42 UTC 2021 x86_64
- python version:
3.7.9 | packaged by conda-forge | (default, Feb 13 2021, 20:03:11) [GCC 9.3.0]
- espnet version:
espnet 0.9.7
- Git hash
b1753d8397546c0684556504aab75efec9aacb22
- Commit date
Sun Feb 14 19:54:42 2021 +0900
- Commit date
- pytorch version
pytorch 1.4.0
Environments from torch.utils.collect_env
:
Collecting environment information…
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.5 LTS GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CMake version: version 3.18.2
Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: Tesla T4 GPU 1: Tesla T4 GPU 2: Tesla T4 GPU 3: Tesla T4
Nvidia driver version: 450.80.02 cuDNN version: Probably one of the following: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.4
Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-wpe==0.0.0
[pip3] torch==1.4.0
[pip3] torch-complex==0.2.0
[pip3] torch-optimizer==0.1.0
[pip3] torchaudio==0.4.0
[pip3] warpctc-pytorch==0.2.1
[conda] mkl 2020.2 256
[conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] pytorch-wpe 0.0.0 pypi_0 pypi
[conda] torch-complex 0.2.0 pypi_0 pypi
[conda] torch-optimizer 0.1.0 pypi_0 pypi
[conda] torchaudio 0.4.0 pypi_0 pypi
[conda] warpctc-pytorch 0.2.1 pypi_0 pypi
Task information:
- Task: ASR
- ESPnet1
Error Segmentation Fault. [ No log file/ additional error description on output ]
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7
Top GitHub Comments
Thanks.
I don’t know the detail of PYTHONFAULTHANDLER. How about gdb if you’ll dive into it more?
I think ESPnet2 may work in your environment, but I recommend you reinstalling Python and pytorch rather than sticking your environment. Indeed, your problem is not a fault of espnet. Anyway, there is nothing we can do for espnet side, I’ll close this issue.
Hi, I have used the scripts from AN4 egs directory, is there any recommendation to which is considered golden recipie for asr?
Also, I exported export PYTHONFAULTHANDLER=0 but couldnt get the trace for segmentation error. Can you guide me how to get the loc where segmentation error is happening.