Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault on using multiple GPUs

See original GitHub issue

Segmentation Fault Error.

Training stops after 1 epoch on using multiple GPUs - showing segmentation fault. However, with the same script and model weights, training is successful in the case of a single GPU. [The memory on the Single GPU was the same as the Memory of each GPU on Multiple GPU machine = 16GB].

Basic environments:

OS information: - OS information: Linux 5.4.0-1037-aws #39~18.04.1-Ubuntu SMP Fri Jan 15 02:48:42 UTC 2021 x86_64
python version: 3.7.9 | packaged by conda-forge | (default, Feb 13 2021, 20:03:11) [GCC 9.3.0]
espnet version: espnet 0.9.7
Git hash b1753d8397546c0684556504aab75efec9aacb22
- Commit date Sun Feb 14 19:54:42 2021 +0900
pytorch version pytorch 1.4.0

Environments from torch.utils.collect_env: Collecting environment information… PyTorch version: 1.4.0 Is debug build: No CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.5 LTS GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CMake version: version 3.18.2

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: Tesla T4 GPU 1: Tesla T4 GPU 2: Tesla T4 GPU 3: Tesla T4

Nvidia driver version: 450.80.02 cuDNN version: Probably one of the following: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.4

Versions of relevant libraries: [pip3] numpy==1.20.1 [pip3] pytorch-ranger==0.1.1 [pip3] pytorch-wpe==0.0.0 [pip3] torch==1.4.0 [pip3] torch-complex==0.2.0 [pip3] torch-optimizer==0.1.0 [pip3] torchaudio==0.4.0 [pip3] warpctc-pytorch==0.2.1 [conda] mkl 2020.2 256
[conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] pytorch-wpe 0.0.0 pypi_0 pypi [conda] torch-complex 0.2.0 pypi_0 pypi [conda] torch-optimizer 0.1.0 pypi_0 pypi [conda] torchaudio 0.4.0 pypi_0 pypi [conda] warpctc-pytorch 0.2.1 pypi_0 pypi

Task information:

Task: ASR
ESPnet1

Error Segmentation Fault. [ No log file/ additional error description on output ]

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:7

Top GitHub Comments

1reaction

kamo-naoyukicommented, Mar 17, 2021

Thanks.

I don’t know the detail of PYTHONFAULTHANDLER. How about gdb if you’ll dive into it more?

I think ESPnet2 may work in your environment, but I recommend you reinstalling Python and pytorch rather than sticking your environment. Indeed, your problem is not a fault of espnet. Anyway, there is nothing we can do for espnet side, I’ll close this issue.

0reactions

SiddheshSinghcommented, Mar 16, 2021

Could you tell us which recipe you used for our record?

You can detect where seg fault comes from using PYTHONFAULTHANDLER. This is a common technique.
export PYTHONFAULTHANDLER=0
However, in most cases, segmentation fault can’t be avoided in the environments because it’s a bug of Python, or something. (In your case, some library might be not thread safe, so you might avoid it with DistributedDataParallel, but espnet1 doesn’t support it.)

I recommend you reinstalling your Python. Perhaps, the python of conda-forge has some problems.

3.7.9 | packaged by conda-forge | (default, Feb 13 2021, 20:03:11) [GCC 9.3.0]

Hi, I have used the scripts from AN4 egs directory, is there any recommendation to which is considered golden recipie for asr?

Also, I exported export PYTHONFAULTHANDLER=0 but couldnt get the trace for segmentation error. Can you guide me how to get the loc where segmentation error is happening.