question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault on using multiple GPUs

See original GitHub issue

Segmentation Fault Error.

Training stops after 1 epoch on using multiple GPUs - showing segmentation fault. However, with the same script and model weights, training is successful in the case of a single GPU. [The memory on the Single GPU was the same as the Memory of each GPU on Multiple GPU machine = 16GB].

Basic environments:

  • OS information: - OS information: Linux 5.4.0-1037-aws #39~18.04.1-Ubuntu SMP Fri Jan 15 02:48:42 UTC 2021 x86_64
  • python version: 3.7.9 | packaged by conda-forge | (default, Feb 13 2021, 20:03:11) [GCC 9.3.0]
  • espnet version: espnet 0.9.7
  • Git hash b1753d8397546c0684556504aab75efec9aacb22
    • Commit date Sun Feb 14 19:54:42 2021 +0900
  • pytorch version pytorch 1.4.0

Environments from torch.utils.collect_env: Collecting environment information… PyTorch version: 1.4.0 Is debug build: No CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.5 LTS GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CMake version: version 3.18.2

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: Tesla T4 GPU 1: Tesla T4 GPU 2: Tesla T4 GPU 3: Tesla T4

Nvidia driver version: 450.80.02 cuDNN version: Probably one of the following: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5 /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.4 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.4

Versions of relevant libraries: [pip3] numpy==1.20.1 [pip3] pytorch-ranger==0.1.1 [pip3] pytorch-wpe==0.0.0 [pip3] torch==1.4.0 [pip3] torch-complex==0.2.0 [pip3] torch-optimizer==0.1.0 [pip3] torchaudio==0.4.0 [pip3] warpctc-pytorch==0.2.1 [conda] mkl 2020.2 256
[conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] pytorch-wpe 0.0.0 pypi_0 pypi [conda] torch-complex 0.2.0 pypi_0 pypi [conda] torch-optimizer 0.1.0 pypi_0 pypi [conda] torchaudio 0.4.0 pypi_0 pypi [conda] warpctc-pytorch 0.2.1 pypi_0 pypi

Task information:

  • Task: ASR
  • ESPnet1

Error Segmentation Fault. [ No log file/ additional error description on output ]

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7

github_iconTop GitHub Comments

1reaction
kamo-naoyukicommented, Mar 17, 2021

Thanks.

I don’t know the detail of PYTHONFAULTHANDLER. How about gdb if you’ll dive into it more?

I think ESPnet2 may work in your environment, but I recommend you reinstalling Python and pytorch rather than sticking your environment. Indeed, your problem is not a fault of espnet. Anyway, there is nothing we can do for espnet side, I’ll close this issue.

0reactions
SiddheshSinghcommented, Mar 16, 2021

Could you tell us which recipe you used for our record?

You can detect where seg fault comes from using PYTHONFAULTHANDLER. This is a common technique.

export PYTHONFAULTHANDLER=0

However, in most cases, segmentation fault can’t be avoided in the environments because it’s a bug of Python, or something. (In your case, some library might be not thread safe, so you might avoid it with DistributedDataParallel, but espnet1 doesn’t support it.)

I recommend you reinstalling your Python. Perhaps, the python of conda-forge has some problems.

3.7.9 | packaged by conda-forge | (default, Feb 13 2021, 20:03:11) [GCC 9.3.0]

Hi, I have used the scripts from AN4 egs directory, is there any recommendation to which is considered golden recipie for asr?

Also, I exported export PYTHONFAULTHANDLER=0 but couldnt get the trace for segmentation error. Can you guide me how to get the loc where segmentation error is happening.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation fault on OpenCL when using multiple GPUs
My simulation runs successfully with one GPU. I can either allow OpenMM to find the GPU, or specify it explicitly with simulation properties....
Read more >
Tensorflow segmentation fault with single machine multiple ...
I believe that what I really do is to collect and average gradients from multiple GPUs and then, update the parameters in my...
Read more >
Segmentation Fault when using GPU - Google Groups
Everything seems to work fine with the CPU but I get seg faults with the GPU. rescomp-12-250088:Project Brett$ python gputest.py.
Read more >
Multigpu, Segmentation fault - PyTorch Forums
I encountered this problem when training with multi gpu after a few epochs. Sometimes the error is “Segmentation fault (core dumped)”.
Read more >
Segmentation fault when using different GPUs - Isaac Gym
Note that the sim_device parameter uses CUDA style device syntax, while the graphics_device parameter uses the vulkan device ID, which may not ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found