Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to stream speech enhancement?

See original GitHub issue

Hello.

I want to process the microphone input signal with the speech enhancement. For testing, I split the file and processed it as shown in the code below.

from espnet2.bin.enh_inference import *
import soundfile
import time

pth_yaml_dir = "./pth_yaml_origin/"

def main():
    separate_speech_frm = SeparateSpeech(pth_yaml_dir + "/sep_trans.yaml", pth_yaml_dir + "coef.pth", normalize_output_wav=False)
    audio, rate = soundfile.read("mic.wav")

    #frame processing
    start_time = time.perf_counter()
    frame_size = 3072
    frame_num = int(len(audio) / frame_size)
    out_data = np.zeros(0)

    for i in range(frame_num):
        data = audio[i*frame_size:(i+1)*frame_size]
        input = data[np.newaxis, :]
        separated_audio = separate_speech_frm(input)
        out_data = np.append(out_data, separated_audio[0].reshape(-1) * 0.5)

    execution_time = time.perf_counter() - start_time
    
    soundfile.write("separated_frm.wav", out_data, rate, format="WAV", subtype="PCM_16")
    print(f'execution_time:{execution_time:.5f}[sec]')

if __name__ == "__main__":
    main()

However, every frame, an odd noise occurred. (see waveform and spectrogram below)

Is there any way to make the speech enhance task work with streaming?

Best regards.

Issue Analytics

State:
Created 10 months ago
Comments:11

Top GitHub Comments

1reaction

Emrys365commented, Nov 17, 2022

If it is possible, could you please provide me with the reference code on ESPnet?

Sure, you could refer to https://github.com/espnet/espnet/blob/master/espnet2/bin/enh_inference.py#L275-L299.

1reaction

Emrys365commented, Nov 17, 2022

So let me ask a few questions.

Is my prediction above correct?

Where are the encoder and decoder codes?

I am not sure if I understand what you meant by “the processing between FFT sizes”. But basically, the discontinuity comes from your manual splitting of the chunks (by the framesize=3072).

Because the pretrained model has its own (STFT) encoder, the input signal will be framed again by the internal sliding window of length 512 and hop 128. So your 3072-point input will be divided into 21 overlapped frames in the encoder.

After processed by the separator, the decoder will reconstruct the 3072-point-long signal based on the overlap-add approach. In this sense, each 3072-point signal is continuous after processing.

The problem here is that adjacent 3072-point signals does not have overlap in between (unlike in STFT). This is likely to cause discontinuities (similar to the Gibbs phenomenon) between different 3072-point signals.

So I would suggest you divide the signals into overlapped 3072-point signals, and manually merge the overlapped region like in iSTFT to improve the continuity.

Top Results From Across the Web

Real-Time Streaming Transformers for Personalised Speech ...

Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding ...

Visual Speech Enhancement Driven Multi-modality Speech ...

Watch to Listen Clearly: Visual Speech Enhancement Driven Multi-modality. Speech Recognition ... The enhanced audio stream and its corresponding video.

Google Launches New Speech Enhancement Tech to Clarify ...

Google has this week launched a new speech enhancement option within YouTube Stories which will enable users to reduce background noise in ...

Speech Enhancement | soundlab - Smart Sound Lab

Inter-channel Conv-TasNet for multichannel speech enhancement ... With the recent rising of live streaming services such as YouTube, outdoor recording is ...

A New Speech Enhancement: Speech Stream Segregation

Speech stream segregation is presented as a new speech enhance- ment for automatic speech recognition. Two issues are addressed: speech stream segregation ...