question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to stream speech enhancement?

See original GitHub issue

Hello.

I want to process the microphone input signal with the speech enhancement. For testing, I split the file and processed it as shown in the code below.

from espnet2.bin.enh_inference import *
import soundfile
import time

pth_yaml_dir = "./pth_yaml_origin/"

def main():
    separate_speech_frm = SeparateSpeech(pth_yaml_dir + "/sep_trans.yaml", pth_yaml_dir + "coef.pth", normalize_output_wav=False)
    audio, rate = soundfile.read("mic.wav")

    #frame processing
    start_time = time.perf_counter()
    frame_size = 3072
    frame_num = int(len(audio) / frame_size)
    out_data = np.zeros(0)

    for i in range(frame_num):
        data = audio[i*frame_size:(i+1)*frame_size]
        input = data[np.newaxis, :]
        separated_audio = separate_speech_frm(input)
        out_data = np.append(out_data, separated_audio[0].reshape(-1) * 0.5)

    execution_time = time.perf_counter() - start_time
    
    soundfile.write("separated_frm.wav", out_data, rate, format="WAV", subtype="PCM_16")
    print(f'execution_time:{execution_time:.5f}[sec]')

if __name__ == "__main__":
    main()

However, every frame, an odd noise occurred. (see waveform and spectrogram below)

1

Is there any way to make the speech enhance task work with streaming?

Best regards.

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:11

github_iconTop GitHub Comments

1reaction
Emrys365commented, Nov 17, 2022

If it is possible, could you please provide me with the reference code on ESPnet?

Sure, you could refer to https://github.com/espnet/espnet/blob/master/espnet2/bin/enh_inference.py#L275-L299.

1reaction
Emrys365commented, Nov 17, 2022

So let me ask a few questions.

  • Is my prediction above correct?
  • Where are the encoder and decoder codes?

I am not sure if I understand what you meant by “the processing between FFT sizes”. But basically, the discontinuity comes from your manual splitting of the chunks (by the framesize=3072).

Because the pretrained model has its own (STFT) encoder, the input signal will be framed again by the internal sliding window of length 512 and hop 128. So your 3072-point input will be divided into 21 overlapped frames in the encoder.

After processed by the separator, the decoder will reconstruct the 3072-point-long signal based on the overlap-add approach. In this sense, each 3072-point signal is continuous after processing.

The problem here is that adjacent 3072-point signals does not have overlap in between (unlike in STFT). This is likely to cause discontinuities (similar to the Gibbs phenomenon) between different 3072-point signals.

So I would suggest you divide the signals into overlapped 3072-point signals, and manually merge the overlapped region like in iSTFT to improve the continuity.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Real-Time Streaming Transformers for Personalised Speech ...
Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding ...
Read more >
Visual Speech Enhancement Driven Multi-modality Speech ...
Watch to Listen Clearly: Visual Speech Enhancement Driven Multi-modality. Speech Recognition ... The enhanced audio stream and its corresponding video.
Read more >
Google Launches New Speech Enhancement Tech to Clarify ...
Google has this week launched a new speech enhancement option within YouTube Stories which will enable users to reduce background noise in ...
Read more >
Speech Enhancement | soundlab - Smart Sound Lab
Inter-channel Conv-TasNet for multichannel speech enhancement ... With the recent rising of live streaming services such as YouTube, outdoor recording is ...
Read more >
A New Speech Enhancement: Speech Stream Segregation
Speech stream segregation is presented as a new speech enhance- ment for automatic speech recognition. Two issues are addressed: speech stream segregation ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found