Textless NLP / GSLM: Speech resynthesis produces something unrelated to source speech
See original GitHub issueWhat is your question?
As far as I understand, examples/textless_nlp/gslm/tools/resynthesize_speech.py
should take a speech sample (audio), encode it to units, and generate output speech from these units. The output speech should resemble the input sample.
However, when I do this with the released pre-trained models, output is gibberish that doesn’t sound like input at all.
I attach the samples and steps I took. Is there anything I do is wrong?
Thank you!
Code
- Download pre-trained models (HuBERT-km200 in this example):
mkdir -p /content/speech/hubert200
cd /content/speech/hubert200
wget https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt -nc
wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/km200/km.bin -nc
wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/tts_km200/tts_checkpoint_best.pt -nc
wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt -nc
- Generate the
code_dict.txt
file. I didn’t find “official” description of how to do it, so I used this comment. Note that if I use dict of size 199 or 200, the models will fail
with open("code_dict.txt", "wt") as f:
for i in range(1, 199): # Effectively 198 items
f.write(str(i) + "\n")
- Download and convert source audio sample from the speech resynthesis example site:
wget https://speechbot.github.io/resynthesis/audio/teaser/p269_182.mp3 -nc
ffmpeg -y -i p269_182.mp3 sample.input.wav
- Run resynthesis:
export FAIRSEQ_ROOT=/home/ubuntu/fairseq
export DATA=/content/speech/hubert200
export TYPE=hubert
echo sample.input.wav > input.txt
echo sample.out.layer5.wav >> input.txt
PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \
--feature_type $TYPE \
--layer 5 \
--acoustic_model_path $DATA/hubert_base_ls960.pt \
--kmeans_model_path $DATA/km.bin \
--tts_model_path $DATA/tts_checkpoint_best.pt \
--code_dict_path $DATA/code_dict.txt \
--waveglow_path $DATA/waveglow_256channels_new.pt \
--max_decoder_steps 1000 < input.txt
- Check the result (in the attachement ). It doesn’t sound like the original audio at all.
What have you tried?
I tried to run resynthesis with different number of units, taking different HuBERT layer for features, different audio, and different offsets for code_dict.txt
In addition to steps outlined above, I tried to generate speech with units2speech
directly from units in devset. It still produces gibberish. This makes me think that the problem may lie in bad pre-trained tts checkpoint.
What’s your environment?
- fairseq Version (e.g., 1.0 or main): main
- PyTorch Version (e.g., 1.0) 1.9.1
- OS (e.g., Linux): Ubuntu 18.04
- How you installed fairseq (
pip
, source): source - Build command you used (if compiling from source):
pip install -e .
- Python version: 3.7.0
- CUDA/cuDNN version: cuda_11.1.TC455_06.29190527_0
- GPU models and configuration: Tesla V100-SXM2
- Any other relevant information:
samples.zip contains generates samples - both audio and units.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (7 by maintainers)
Top GitHub Comments
@eugene-kharitonov, the updated checkpoint
df4a9c6f
works great!Having spent over a week in futile attempts to reproduce the results, this newly generated sample sounds like heavenly music to my ears – just can’t stop listening to it! 😃
Now, all other checkpoints I tried (hubert200, hubert500, logmel-100) had the same problem with generating gibberish. Could you please double check if those files (likely, all other TTS checkpoints) should be re-released as well?
Thanks a lot for an impressive piece of work!
@asivokon I’ve updated TTS checkpoints + provided code_dict files and manually verified that a few of the checkpoints work. @bradgrimm unfortunately, it seems we don’t have a good Hubert500 model. As those were not used in the paper, we decided not to support the case of 500 unit models. Sorry about the confusion.
Thanks for your help!