question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100)

See original GitHub issue

Describe the bug The training fails inbetween in the first epoch. But have checked the datasets. and it seems to be clean.

Basic environments:

  • OS information: Ubuntu 20.04.3 LTS
  • python version: Python 3.8.10
  • espnet version: 0.10.7a1
  • Git hash cb8181a99bb59f8444b59a36c38878f95570faaf
    • Commit date Tue Mar 8 15:58:27 2022 -0500
  • pytorch version [e.g. pytorch 1.4.0]

You can obtain them by the following command

cd <espnet-root>/tools
. ./activate_python.sh

echo "- OS information: `uname -mrsv`"
python3 << EOF
import sys, espnet, torch
pyversion = sys.version.replace('\n', ' ')
print(f"""- python version: \`{pyversion}\`
- espnet version: \`espnet {espnet.__version__}\`
- pytorch version: \`pytorch {torch.__version__}\`""")
EOF
cat << EOF
- Git hash: \`$(git rev-parse HEAD)\`
  - Commit date: \`$(git log -1 --format='%cd')\`
EOF

Environments from torch.utils.collect_env: e.g.,

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.0

OS: CentOCollecting environment information...
PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.2
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.13.0-1021-oracle-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.6.112
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 510.47.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-wpe==0.0.1
[pip3] torch==1.10.1+cu113
[pip3] torch-complex==0.4.3
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1+cu113
[conda] Could not collect

You can obtain them by the following command

cd <espnet-root>/tools
. ./activate_python.sh
python3 -m torch.utils.collect_env

Task information:

  • Task: TTS
  • Recipe: LJSpeech
  • ESPnet2

To Reproduce Steps to reproduce the behavior:

  1. move to a recipe directory, e.g., cd egs/librispeech/asr1
  2. execute run.sh with specific arguments, e.g., run.sh --stage 3 --ngp 1
  3. specify the error log, e.g., exp/xxx/yyy.log

Error logs

Attaching the failed train logs failed_train.log

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
kan-bayashicommented, Mar 15, 2022

Great. Thank you for your patient investigation.

Does find_unused_parameters=True slow down training significantly? Is it ok to disable find_unused_parameters?

I forgot the reason find_unused_parameters=True. Maybe I encountered the error and set to True but it is worthwhile to try.

1reaction
kan-bayashicommented, Mar 15, 2022

@prajwaljpj Please clarify ljspeech recipe itself works or not when using vits. I want to identify the error is came from data or environment.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-GPU performance improvement for ASR in ESPnet1 by ...
This issue is a proposal to improve the multi-GPU training performance ... The espnet2 multi-gpu issues seem to depend on the environments.
Read more >
Distributed training — ESPnet 202211 documentation
ESPnet2 provides some kinds of data-parallel distributed training. ... Note: The behavior of batch size in ESPnet2 during multi-GPU training is different ...
Read more >
Issues-espnet/espnet - PythonTechWorld
Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100). 888. Describe the bug The training fails inbetween in the first epoch.
Read more >
espnet - bytemeta
Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100) ... `CUDA error: device-side assert triggered` when finetuning multi-speaker VITS.
Read more >
Link to internal type within module - Jsdoc/Jsdoc - IssueHint
Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100), 10, 2022-03-12, 2022-10-14. help please, 1, 2021-07-17, 2021-11-25.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found