Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Textless NLP / GSLM: Speech resynthesis produces something unrelated to source speech

See original GitHub issue

What is your question?

As far as I understand, examples/textless_nlp/gslm/tools/resynthesize_speech.py should take a speech sample (audio), encode it to units, and generate output speech from these units. The output speech should resemble the input sample.

However, when I do this with the released pre-trained models, output is gibberish that doesn’t sound like input at all.

I attach the samples and steps I took. Is there anything I do is wrong?

Thank you!

Code

Download pre-trained models (HuBERT-km200 in this example):

mkdir -p /content/speech/hubert200
cd /content/speech/hubert200
wget https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt -nc 
wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/km200/km.bin -nc
wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/tts_km200/tts_checkpoint_best.pt -nc 
wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt -nc

Generate the code_dict.txt file. I didn’t find “official” description of how to do it, so I used this comment. Note that if I use dict of size 199 or 200, the models will fail

with open("code_dict.txt", "wt") as f:
    for i in range(1, 199):   # Effectively 198 items
        f.write(str(i) + "\n")

Download and convert source audio sample from the speech resynthesis example site:

wget https://speechbot.github.io/resynthesis/audio/teaser/p269_182.mp3 -nc
ffmpeg -y -i p269_182.mp3 sample.input.wav

Run resynthesis:

export FAIRSEQ_ROOT=/home/ubuntu/fairseq
export DATA=/content/speech/hubert200
export TYPE=hubert

echo sample.input.wav > input.txt
echo sample.out.layer5.wav >> input.txt

PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \
    --feature_type $TYPE \
    --layer 5 \
    --acoustic_model_path $DATA/hubert_base_ls960.pt \
    --kmeans_model_path $DATA/km.bin \
    --tts_model_path $DATA/tts_checkpoint_best.pt \
    --code_dict_path $DATA/code_dict.txt \
    --waveglow_path $DATA/waveglow_256channels_new.pt \
    --max_decoder_steps 1000 < input.txt

Check the result (in the attachement ). It doesn’t sound like the original audio at all.

What have you tried?

I tried to run resynthesis with different number of units, taking different HuBERT layer for features, different audio, and different offsets for code_dict.txt

In addition to steps outlined above, I tried to generate speech with units2speech directly from units in devset. It still produces gibberish. This makes me think that the problem may lie in bad pre-trained tts checkpoint.

What’s your environment?

fairseq Version (e.g., 1.0 or main): main
PyTorch Version (e.g., 1.0) 1.9.1
OS (e.g., Linux): Ubuntu 18.04
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install -e .
Python version: 3.7.0
CUDA/cuDNN version: cuda_11.1.TC455_06.29190527_0
GPU models and configuration: Tesla V100-SXM2
Any other relevant information:

samples.zip contains generates samples - both audio and units.

Issue Analytics

State:
Created 2 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

4reactions

asivokoncommented, Oct 28, 2021

@eugene-kharitonov, the updated checkpoint df4a9c6f works great!

Having spent over a week in futile attempts to reproduce the results, this newly generated sample sounds like heavenly music to my ears – just can’t stop listening to it! 😃

Now, all other checkpoints I tried (hubert200, hubert500, logmel-100) had the same problem with generating gibberish. Could you please double check if those files (likely, all other TTS checkpoints) should be re-released as well?

Thanks a lot for an impressive piece of work!

2reactions

eugene-kharitonovcommented, Oct 29, 2021

@asivokon I’ve updated TTS checkpoints + provided code_dict files and manually verified that a few of the checkpoints work. @bradgrimm unfortunately, it seems we don’t have a good Hubert500 model. As those were not used in the paper, we decided not to support the case of 500 unit models. Sorry about the confusion.

Thanks for your help!

Top Results From Across the Web

Textless NLP / GSLM: Speech resynthesis produces ... - GitHub

Textless NLP / GSLM: Speech resynthesis produces something unrelated to source speech · Issue #3970 · facebookresearch/fairseq · GitHub.

Textless NLP: Generating expressive speech from raw audio

GSLM leverages recent breakthroughs in representation learning, allowing it to work directly from only raw audio signals, without any labels or text.

textless NLP project

The Textless NLP project Independently, recent breakthrough in representation learning has yielded models able to discover discrete units from raw audio ...

ISCA Archive - International Speech Communication Association

Takayuki Arai (2019), Sound sources used in speech production research with physical models of the human vocal tract, HSCR

Emmanuel Dupoux Home Page

Phonological 'deafnesses' in speech perception: acquisition and plasticity. ... Textless Speech Emotion Conversion using Decomposed and Discrete ...