Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100)
See original GitHub issueDescribe the bug The training fails inbetween in the first epoch. But have checked the datasets. and it seems to be clean.
Basic environments:
- OS information: Ubuntu 20.04.3 LTS
- python version: Python 3.8.10
- espnet version: 0.10.7a1
- Git hash cb8181a99bb59f8444b59a36c38878f95570faaf
- Commit date Tue Mar 8 15:58:27 2022 -0500
- pytorch version [e.g. pytorch 1.4.0]
You can obtain them by the following command
cd <espnet-root>/tools
. ./activate_python.sh
echo "- OS information: `uname -mrsv`"
python3 << EOF
import sys, espnet, torch
pyversion = sys.version.replace('\n', ' ')
print(f"""- python version: \`{pyversion}\`
- espnet version: \`espnet {espnet.__version__}\`
- pytorch version: \`pytorch {torch.__version__}\`""")
EOF
cat << EOF
- Git hash: \`$(git rev-parse HEAD)\`
- Commit date: \`$(git log -1 --format='%cd')\`
EOF
Environments from torch.utils.collect_env
:
e.g.,
Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.0
OS: CentOCollecting environment information...
PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.2
Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.13.0-1021-oracle-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.6.112
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB
Nvidia driver version: 510.47.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-wpe==0.0.1
[pip3] torch==1.10.1+cu113
[pip3] torch-complex==0.4.3
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1+cu113
[conda] Could not collect
You can obtain them by the following command
cd <espnet-root>/tools
. ./activate_python.sh
python3 -m torch.utils.collect_env
Task information:
- Task: TTS
- Recipe: LJSpeech
- ESPnet2
To Reproduce Steps to reproduce the behavior:
- move to a recipe directory, e.g.,
cd egs/librispeech/asr1
- execute
run.sh
with specific arguments, e.g.,run.sh --stage 3 --ngp 1
- specify the error log, e.g.,
exp/xxx/yyy.log
Error logs
Attaching the failed train logs failed_train.log
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Multi-GPU performance improvement for ASR in ESPnet1 by ...
This issue is a proposal to improve the multi-GPU training performance ... The espnet2 multi-gpu issues seem to depend on the environments.
Read more >Distributed training — ESPnet 202211 documentation
ESPnet2 provides some kinds of data-parallel distributed training. ... Note: The behavior of batch size in ESPnet2 during multi-GPU training is different ...
Read more >Issues-espnet/espnet - PythonTechWorld
Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100). 888. Describe the bug The training fails inbetween in the first epoch.
Read more >espnet - bytemeta
Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100) ... `CUDA error: device-side assert triggered` when finetuning multi-speaker VITS.
Read more >Link to internal type within module - Jsdoc/Jsdoc - IssueHint
Espnet2 train VITS fails on multi-GPU environment (8xA100 and 8xV100), 10, 2022-03-12, 2022-10-14. help please, 1, 2021-07-17, 2021-11-25.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Great. Thank you for your patient investigation.
I forgot the reason
find_unused_parameters=True
. Maybe I encountered the error and set to True but it is worthwhile to try.@prajwaljpj Please clarify ljspeech recipe itself works or not when using vits. I want to identify the error is came from data or environment.