ESPnet2-TTS development plan

See original GitHub issue

TODO

Documentation
- README.md
- egs2/TEMPLATE/tts1/README.md
  - joint training
  - vits training
- egs2/jvs/tts1/README.md
  - fastspeech adap
  - vits adap

New Features

Text embedding frontend (haggingface/transformer?)
Support from_pretrained function @kan-bayashi
Support GAN-based training @kan-bayashi #3436
Support speaker id input @kan-bayashi #3452 #3453 #3490
Joint training of text2mel and vocoder @kan-bayashi #3501 #3508
Support language id input @kan-bayashi #3489 #3490
Integrate the use of parallel_wavegan’s vocoder in inference @kan-bayashi #3513
Joint trainable vocoders @kan-bayashi
- hifigan
- parallel wavegan #3515
- melgan #3516
- style melgan #3517
- loss modules

Models

VITS @kan-bayashi #3436 #3437 #3438 #3439 #3448 #3449
AdaSpeech https://arxiv.org/abs/2103.00993
AdaSpeech2 https://arxiv.org/abs/2104.09715
DenoiSpeech
Translatotron2

Vocoders

Pretrained models of HiFiGAN or StyleMelGAN @kan-bayashi
- Libritts
- vctk
- csmsc
- ljspeech
- jsut
HiFi-GAN @kan-bayashi
- Initial implemention https://github.com/kan-bayashi/ParallelWaveGAN/pull/273 https://github.com/kan-bayashi/ParallelWaveGAN/pull/275 https://github.com/kan-bayashi/ParallelWaveGAN/pull/276 https://github.com/kan-bayashi/ParallelWaveGAN/pull/277
- Tuning https://github.com/kan-bayashi/ParallelWaveGAN/issues/278
StyleMelGAN @kan-bayashi
- Initial implementation https://github.com/kan-bayashi/ParallelWaveGAN/pull/274
- Tuning https://github.com/kan-bayashi/ParallelWaveGAN/issues/282

Recipe

つくよみちゃんコーパス @kan-bayashi #3552
CSS10 @kan-bayashi #3464
RUSLAN @kan-bayashi #3378 #3390
HUI-audio-corpus-german @kan-bayashi #3375 #3381 #3391
KKS dataset @kan-bayashi #3383 #3400
JTubeSpeech @Takaaki-Saeki #3459
J-MAC
J-KAC @TanUkkii007 #3468
JMD @takenori-y #3394
AISHELL-3 @ftshijt #3473
SynPaFlex-Corpus
The SIWIS French Speech Synthesis Database @takenori-y #3460
CMU INDIC @peter-yh-wu #3401
Hi-Fi TTS
THCHS30 @ftshijt #3473
DiDiSpeech
IndicSpeech @peter-yh-wu #3435

Functions

Multi-lingual G2P
Korean G2P @kan-bayashi #3383
Runssian G2P @kan-bayashi #3377
German G2P @kan-bayashi #3371
Spanish G2P @kan-bayashi #3373
French G2P @kan-bayashi #3372
Greek G2P @kan-bayashi #3463
Finnish G2P @kan-bayashi #3463
Hungarian G2P @kan-bayashi #3463
Dutch G2P @kan-bayashi #3463
Enhanced Japanenes G2P @kan-bayashi #3558 #3561
Silence trimming at the beginning and the end of audio @kan-bayashi #3380
Silence trimming at the middle of audio
Conversion of MFA alignment to durations file
Audio quality checker for filtering
Transcription quality checker for flitering
Evaluation stage
- ASR eval @kan-bayashi #3569
- ~~MOSnet eval~~
- MCD eval
- ~~FDSD eval~~
Quantized decoding

Minor functions

Overwrite the decoding params @kan-bayashi
Fix the seed in the inference @kan-bayashi
TTS inference interface modification @kan-bayashi

Any suggestions are welcome.

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

sw005320commented, Aug 2, 2021

Quantization?

1reaction

Freddy-ppcommented, Aug 2, 2021

Any plans for Fre-GAN? https://arxiv.org/abs/2106.02297 https://github.com/rishikksh20/Fre-GAN-pytorch

Top Results From Across the Web

ESPnet: end-to-end speech processing toolkit - GitHub Pages

ESPnet2-TTS realtime demonstration · CMU 11751/18781 2021: ESPnet Tutorial · Run an inference example · Full installation · Run a recipe example ·...

The 2020 ESPnet Update: New Features, Broadened ...

The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS),...

ESPnet-TTS: Unified, Reproducible, and Integratable Open ...

This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech ...

The 2020 ESPnet Update: New Features, Broadened ...

This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project ...

espnet · PyPI

We are moving on ESPnet2-based development for TTS. The use of ESPnet1-TTS is deprecated, please use ESPnet2-TTS. SE: Speech enhancement (and separation).