How to stream speech enhancement?
See original GitHub issueHello.
I want to process the microphone input signal with the speech enhancement. For testing, I split the file and processed it as shown in the code below.
from espnet2.bin.enh_inference import *
import soundfile
import time
pth_yaml_dir = "./pth_yaml_origin/"
def main():
separate_speech_frm = SeparateSpeech(pth_yaml_dir + "/sep_trans.yaml", pth_yaml_dir + "coef.pth", normalize_output_wav=False)
audio, rate = soundfile.read("mic.wav")
#frame processing
start_time = time.perf_counter()
frame_size = 3072
frame_num = int(len(audio) / frame_size)
out_data = np.zeros(0)
for i in range(frame_num):
data = audio[i*frame_size:(i+1)*frame_size]
input = data[np.newaxis, :]
separated_audio = separate_speech_frm(input)
out_data = np.append(out_data, separated_audio[0].reshape(-1) * 0.5)
execution_time = time.perf_counter() - start_time
soundfile.write("separated_frm.wav", out_data, rate, format="WAV", subtype="PCM_16")
print(f'execution_time:{execution_time:.5f}[sec]')
if __name__ == "__main__":
main()
However, every frame, an odd noise occurred. (see waveform and spectrogram below)
Is there any way to make the speech enhance task work with streaming?
Best regards.
Issue Analytics
- State:
- Created 10 months ago
- Comments:11
Top Results From Across the Web
Real-Time Streaming Transformers for Personalised Speech ...
Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding ...
Read more >Visual Speech Enhancement Driven Multi-modality Speech ...
Watch to Listen Clearly: Visual Speech Enhancement Driven Multi-modality. Speech Recognition ... The enhanced audio stream and its corresponding video.
Read more >Google Launches New Speech Enhancement Tech to Clarify ...
Google has this week launched a new speech enhancement option within YouTube Stories which will enable users to reduce background noise in ...
Read more >Speech Enhancement | soundlab - Smart Sound Lab
Inter-channel Conv-TasNet for multichannel speech enhancement ... With the recent rising of live streaming services such as YouTube, outdoor recording is ...
Read more >A New Speech Enhancement: Speech Stream Segregation
Speech stream segregation is presented as a new speech enhance- ment for automatic speech recognition. Two issues are addressed: speech stream segregation ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sure, you could refer to https://github.com/espnet/espnet/blob/master/espnet2/bin/enh_inference.py#L275-L299.
I am not sure if I understand what you meant by “the processing between FFT sizes”. But basically, the discontinuity comes from your manual splitting of the chunks (by the framesize=3072).
Because the pretrained model has its own (STFT) encoder, the input signal will be framed again by the internal sliding window of length 512 and hop 128. So your 3072-point input will be divided into 21 overlapped frames in the encoder.
After processed by the separator, the decoder will reconstruct the 3072-point-long signal based on the overlap-add approach. In this sense, each 3072-point signal is continuous after processing.
The problem here is that adjacent 3072-point signals does not have overlap in between (unlike in STFT). This is likely to cause discontinuities (similar to the Gibbs phenomenon) between different 3072-point signals.
So I would suggest you divide the signals into overlapped 3072-point signals, and manually merge the overlapped region like in iSTFT to improve the continuity.