Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100)

See original GitHub issue

Describe the bug The training fails inbetween in the first epoch. But have checked the datasets. and it seems to be clean.

Basic environments:

OS information: Ubuntu 20.04.3 LTS
python version: Python 3.8.10
espnet version: 0.10.7a1
Git hash cb8181a99bb59f8444b59a36c38878f95570faaf
- Commit date Tue Mar 8 15:58:27 2022 -0500
pytorch version [e.g. pytorch 1.4.0]

You can obtain them by the following command

cd <espnet-root>/tools
. ./activate_python.sh

echo "- OS information: `uname -mrsv`"
python3 << EOF
import sys, espnet, torch
pyversion = sys.version.replace('\n', ' ')
print(f"""- python version: \`{pyversion}\`
- espnet version: \`espnet {espnet.__version__}\`
- pytorch version: \`pytorch {torch.__version__}\`""")
EOF
cat << EOF
- Git hash: \`$(git rev-parse HEAD)\`
  - Commit date: \`$(git log -1 --format='%cd')\`
EOF

Environments from torch.utils.collect_env: e.g.,

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.0

OS: CentOCollecting environment information...
PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.2
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.13.0-1021-oracle-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.6.112
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 510.47.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-wpe==0.0.1
[pip3] torch==1.10.1+cu113
[pip3] torch-complex==0.4.3
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1+cu113
[conda] Could not collect

You can obtain them by the following command

cd <espnet-root>/tools
. ./activate_python.sh
python3 -m torch.utils.collect_env

Task information:

Task: TTS
Recipe: LJSpeech
ESPnet2

To Reproduce Steps to reproduce the behavior:

move to a recipe directory, e.g., cd egs/librispeech/asr1
execute run.sh with specific arguments, e.g., run.sh --stage 3 --ngp 1
specify the error log, e.g., exp/xxx/yyy.log

Error logs

Attaching the failed train logs failed_train.log

Issue Analytics

State:
Created 2 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

kan-bayashicommented, Mar 15, 2022

Great. Thank you for your patient investigation.

Does find_unused_parameters=True slow down training significantly? Is it ok to disable find_unused_parameters?

I forgot the reason find_unused_parameters=True. Maybe I encountered the error and set to True but it is worthwhile to try.

1reaction

kan-bayashicommented, Mar 15, 2022

@prajwaljpj Please clarify ljspeech recipe itself works or not when using vits. I want to identify the error is came from data or environment.

Top Results From Across the Web

Multi-GPU performance improvement for ASR in ESPnet1 by ...

This issue is a proposal to improve the multi-GPU training performance ... The espnet2 multi-gpu issues seem to depend on the environments.

Distributed training — ESPnet 202211 documentation

ESPnet2 provides some kinds of data-parallel distributed training. ... Note: The behavior of batch size in ESPnet2 during multi-GPU training is different ...

Issues-espnet/espnet - PythonTechWorld

Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100). 888. Describe the bug The training fails inbetween in the first epoch.

espnet - bytemeta

Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100) ... `CUDA error: device-side assert triggered` when finetuning multi-speaker VITS.

Link to internal type within module - Jsdoc/Jsdoc - IssueHint

Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100), 10, 2022-03-12, 2022-10-14. help please, 1, 2021-07-17, 2021-11-25.